STA 279 Lab 2
Complete all Questions.
Goals
In our last class, we saw how we can analyze text data from lyrics.
Today, we are going to practice this code and process with our
headlines
dataset from Lab 1.
Data
We will continue working with the same data on \(n=2000\) article titles from our last lab. To load the data, copy and paste the code below into a chunk and press play.
headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")
Recall that the columns are:
title
: the title of the articleclickbait
: a human generated indicator for whether or not an article is clickbait; FALSE means the article is not clickbait while TRUE means the article is clickbait.
ids
: a numeric variable assigned to each article; think of this like an article identifier.
Once you have loaded the data, load the packages you will need for this lab:
library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
# New!!
library(tm)
library(forcats)
As a reminder, anytime R tells you a package cannot be found means that it needs to be installed. You can do this from the Tools drop down menu at the top of your R screen. Let me know if you have any questions!
Clickbait Only: Data Cleaning
We are first going to do some EDA on the clickbait articles. In other
words, we want to explore the titles that belong to clickbait articles
to see if we can describe traits of these titles. This means that our
first step today is to create a dataset that only contains clickbait
titles, as headlines
has both clickbait and non clickbait
articles.
To create a dataset with only clickbait articles, we want to
filter
our dataset to contain only the rows in
headlines
that belong to clickbait titles. We will use the
command filter
to help us do this.
The filter
command keeps only the rows in the dataset
that meet the condition inside of the parentheses. In this case, we keep
only the clickbait titles, which means only the ones with
TRUE
in the clickbait
column.
Question 1
Create a dataset called notclickbait
with only the
non-clickbait titles. As an answer to this question, state how many rows
are in the dataset you create.
Question 2
We have used the command select
in this course already.
What is the difference between select
and
filter
?
Hint: If you get stuck, look at our last lab and find one of the
places we used select
. Compare that usage to the use of
filter
we just used above.
We now have a dataset called clickbait
which contains
only clickbait titles. The first thing we typically do with this dataset
is to tokenize the article titles.
Question 3
Tokenize and store the titles in the clickbait
dataset
keeping hyphenated words together. Recall from the last
lab that we should use tidy_clickbait
as the name of where
you store the tokenized titles.
As an answer to this question, state how many rows are in
tidy_clickbait
.
Clickbait Only: Most Frequent Words
Now that we have tokenized the titles, let’s see if we can determine which words occur most frequently in the clickbait titles. Here is where the fact that tokenizing data in R also results in all words being converted to lowercase is a good thing. Without this, R would count “Today” and “today” as different words. Because we convert all words to lower case, this issue is avoided!
In order to find out which words appear most frequently, we need
count the number of times each word appears. We can count the number of
times each word appears in tidy_clickbaitonly
using this
code:
clickbait_count <- headlines |>
filter(clickbait == "TRUE") |>
unnest_tokens(word, title, token = "regex") |>
count(word, sort = TRUE)
Question 4
Annotate the code above. In other words, add in comments using
#
that explain briefly what each line of code does.
This code will create a data frame with 2 columns. The first column
(word
) tells us one unique word that occurs in clickbait
titles. The second column (n
) is the number of times that
word appears in our clickbait titles.
Question 5
How many unique words are there in total in the clickbait titles?
Hint: You do not need to write any code to figure this out!
Question 6
Which word appears the most often in the clickbait titles, and how often does this word appear?
Stop Words
You will likely notice something about the words that have the highest counts in the clickbait titles. We see words like “the” and “to”. These are all very common words in the English language, but we were hoping to find some words that could distinguish clickbait titles. For this goal, words like “the” are not helpful to us.
We call words that are needed for grammatical reasons, but that do not add to the content, stop words. You will find that we use a lot of stop words in English, including “the”, “it”, “a”, “as”, “an”, and so on.
If you are interested in what other words are considered stop words, you can use the following code to explore:
Question 7
How many words are there in the stop_words
dataset?
Let’s take a look at the 13th title in the dataset:
## [1] "Judge Guilty in Kickbacks Is Accused of Fixing Suit"
Question 8
Which words do you think could be removed from the title without losing the key content? In other words, which words do you think are stop words?
Since stop words do not actually add to the content, we often consider removing stop words before we conduct any analysis on text data. As we have already seen, if we do not remove stop words they often dominate the list of most frequent words, making it difficult to see which words might be useful features to differentiate clickbait and non-clickbait titles.
To (1) tokenize the 13th title and (2) remove the stop words, we use the following:
# Start with the 13th title
headlines[13,] |>
# Break the title into words
unnest_tokens(word, title, token = "regex")|> # and then
# Remove all stop words
anti_join(stop_words, by = "word")
Question 9
Why do you think we remove the stop words as the last step in this code? In other words, why does it come after tokenizing rather than before?
Question 10
When we counted the most frequent words in clickbait titles, we used:
clickbait_count <- headlines |>
filter(clickbait == "TRUE") |>
unnest_tokens(word, title, token = "regex") |>
count(word, sort = TRUE)
Adapt the code above to remove the stop words. As the answer to this question, state how many unique non-stop words are in the clickbait titles.
Considering Pronouns
In our first class, we talked about this data set and considered some key features that might help us distinguish clickbait titles from non-clickbait titles. One of the things we noticed was that pronouns like “you” we potentially helpful to look at.
However…when we remove stop words the way we have, we also remove pronouns. For our lab today, that’s probably something we do not want.
To tell R to remove all stop words EXCEPT the word “you”, we can remove “you” from the list of stop words using the following code:
We can then use anti_join(stop_words)
like usual to
remove the stop words, while keeping “you”.
Question 11
Adapt the line of code above to remove “we” from the list of stop words. Show your code.
Question 12
When we counted the most frequent words in clickbait titles, we used:
clickbait_count <- headlines |>
filter(clickbait == "TRUE") |>
unnest_tokens(word, title, token = "regex") |>
count(word, sort = TRUE)
Adapt the code above to remove the stop words EXCEPT “you” and “we”. As the answer to this question, state how many unique non-stop words are in the clickbait titles.
Visualizing the Most Frequent Words
Before we got off on our tangent about stop words, we were interested in looking at the most frequent words in clickbait articles (excluding stop words!). We are going to visualize the most frequent words in clickbait articles using (1) bar plots and (2) word clouds.
Bar Plots
A bar plot (or bar graph) is a familiar graph, and it turns out it is very useful when we want to visualize the most frequent words in a piece of text. Bar plots allow us to look at a categorical variable (in this case a word) and plot how many times that variable (word) appears in a data set.
To create plots in R, we will use the ggplot
package.
This allows us to make very pretty, very professional plots that we can
customize.
The basic structure of a bar plot is
Question 13
Adapt the code above to plot the number of times each non-stop word appears in the clickbait titles, making the bars a color (like “blue”, “purple”, “green”, etc.)
NOTE: The plot will NOT be pretty! That’s okay, we’ll make it better!!
Question 14
The plot we just created has several issues. Tell me at least two of them!
Okay, so that’s a plot…but it’s not useful. There are a few issues with it, and we can adapt it to make it better.
One problem is that there are just too many words to be shown effectively on a plot. We don’t really want to see every word - our goal is to see words that characterize clickbait, which means we want to see words that occur frequently. One way to do this is to visualize the top 20, 15, 10, etc., most frequent words, rather than all the words.
To find the 15 words that appear most often in clickbait titles, we
slice our data to keep only the top 15 most frequent
words. We use the slice_max
command to do this.
Here, it gets a little confusing because n
is the name
of the column that holds the count. However, the function also uses
n = 15
to communicate how many words we want to choose,
which in this case is 15.
As a side note, basically whenever you use the count
function, the result will be in column called n
. Because we
use n
for a lot of thing in statistics, you can always
rename the column if you’d like!
Question 15
Go back to your code from Question 12 and add
slice_max(n, n=15)
to the end of it. Then, re-run your code from Question 13 and show your new bar plot.What issues still seem to be occurring the bar plot?
To fix the final issue, we need to adapt the aes
part of
the code. Where we have y = word
, we need
y = fct_reorder(word, n)
. This code tells R to please order
the bars so that the words with the highest frequency (biggest
n
) are on top.
Question 16
Adapt your code from Question 15 so the bars are ordered from largest count to smallest count. Show your graph!
If you create the bar plot using the code above, you will notice something a little odd. Some of our top “words” in the clickbait article titles are numbers! Does this happen with the non-clickbait titles??
Question 17
Create a bar graph for the top 15 words in the non-clickbait titles. Color the bars anything other than white, gray, black, or blue, and make sure to check your labels!!
From our analysis so far, we can see that having numbers in the title might be a good feature to use to identify whether or not a title is clickbait. Great!!
However, what if we would actually like to look at the top words (not numbers) in the titles? If we want to remove the numbers, we can add a line to our process of tokenizing:
# Start with the headlines dataset
tidy_clickbaitonly <- headlines |>
# Choose only the clickbait titles
filter(clickbait =="TRUE") |>
# Tokenize the titles
unnest_tokens(word, title, token = "regex") |>
# Remove the stop words
anti_join(stop_words) |>
# Remove the numbers
filter(!grepl('[0-9]', word))
Adding all the comments with #
as I have done in the
code above is called annotating your code. This leaves
notes for yourself and for others that explains clearly what each line
of the code is doing. When we write professional code, it is generally
annotated.
Question 18
Create a new bar plot of the top 10 words in the clickbait titles, after the numbers have been removed. Show the plot.
Question 19
Based on the bar graphs after removing numbers, comment on the difference between the words that seem to appear in clickbait titles versus non-clickbait titles.
Word Clouds
Another way we can visualize the most frequent words is by using a word cloud. A word cloud is a type of plot specific to text data that will show the most popular words in a dataset. The larger the size of a word in the plot, the more frequently that word appears in the dataset. This allows us to quickly compare the top few words.
Let’s start off by creating a word cloud with only the top 10 words in the clickbait dataset.
# Load the library
library(wordcloud)
set.seed(279)
# Make the plot
tidy_clickbaitonly |>
count(word,sort = TRUE) |>
with(wordcloud(word, n, max.words = 10))
The max.words
part of the code specifies how many words
will appear in the word cloud. If you want the top 15 words, for
example, you use max.words = 15
.
Question 20
Create a word cloud for the top 20 words in the non-clickbait titles (after removing stop words).
As a side note, you can format your word clouds with different color palettes. For example:
References
Data
The dataset used in this lab is the sample_headlines
dataset downloaded from
https://github.com/nicholasjhorton/textclassificationexamples/tree/master.
Citation: Horton, Nicholas J. Text Classification Examples,
Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.
Code
The code used in this lab is adapted from https://www.tidytextmining.com/ for STA 279. This adaptation is not endorsed by the original authors. Citation: Silge, Julia and Robinson, David. Text Mining with R: A Tidy Approach, Last built on 2024-08-13. Source: https://www.tidytextmining.com/