STA 279 Lab 1
Complete all Questions.
The Goal
Today, we are going to get started learning the code that will help us work with text data in R. A lot of the work in using text data comes in being able to manipulate what you have (a string of words and phrases) into what you need (usable content for drawing conclusions and building models). Today, we are going to start to explore the coding structure we will use to do this.
The primary goals will be (1) getting started with tidyR and the tidyText package and (2) using these tools to tokenize phrases. Tokenizing is the first step to a lot of things in text analysis!
Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.
Clearing the Work space
You learned R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.
To clear all of the data in the upper right hand panel, look at the top of the panel and find an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.
Installing Packages
Once we have opened up an RMarkdown file, the first thing we need to do is load the packages we need to run our analysis. A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package.This means that our first step is to remind ourselves how to install an R package.
To install a package:
- Look at the top of your RStudio screen and find “Tools”.
- From the drop down menu, choose “Install Packages”.
- In the white box, type
dplyr
, and then hit install.
- In the white box, type
We actually need a few packages for today: tidytext
,
tidyr
, dplyr
, and ggplot2
. We
will need these packages for essentially everything we do in this
course. Luckily you only have to install each package once!!!
Go ahead and install all 4 packages.
Loading Packages
Now that we have the package installed, we need to tell R to actually use that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).
To tell RMarkdown we are about to give it code, there are two ways to do it.
- Option 1: Look at the top of your Markdown file, and find Code. Click it. From the drop-down menu, choose Insert Chunk. Click it! A gray box should appear in your Markdown file. This is a chunk.
- Option 2: In the top gray bar of your Markdown file, look for a small green C (shown below). Click that, and choose R! A gray box should appear in your Markdown file. This is a chunk.
When you are done, you should have a code chunk that looks like this:
Anything we put inside a chunk (that gray box) will be treated as computer code. Go ahead and put the code below inside a chunk.
suppressMessages( library(tidytext) )
suppressMessages( library(tidyr) )
suppressMessages( library(dplyr) )
suppressMessages( library(ggplot2) )
This means you should have something like this:
Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play). This tells R to run the code!
When you run this code, it will look like nothing happens…and that’s because all we told R to do was get ready to use these packages. With such a command, nothing will show up on our screen, and that’s okay!
Note: In the code above, you’ll notice I have
suppressMessages
around each library. This is not strictly
necessary, but if you don’t include it, R prints out messages / comments
about the libraries when you load them in, for example:
Attaching package: dplyr
. This is annoying when we want to
have professionally formatted files, so suppressMessages
just tells R we don’t want to see those.
Loading the Data
Now that we have the libraries loaded, we need the data. In our lab today, we are going to continue working with article titles as we did in the first class. We will work with \(n= 2000\) titles, and to load the data, copy and paste the following into a chunk in R and press play:
headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")
The code above will load a data set called headlines
with \(n = 2000\) rows and 3 columns
into R. The columns are:
title
: the title (headline) of the articleclickbait
: a human generated indicator for whether or not an article is clickbait; FALSE means it is not clickbait and TRUE means it is clickbait.ids
: a number assigned to each article; think of this like an article identifier.
Now that we have data, we are ready to use it to start answering our lab questions! We will use Markdown (or Quarto if you prefer) files in our course to submit labs. When you are answering a lab question, your set up should look like this:
The ##
allows your Question numbers of show up in bold
so I can easily find them for grading. The *
puts your text
in italics. Let’s try it!
Question 1
The variable clickbait
is an indicator variable. Remind
me - what is an indicator variable? Hint: This can also be
called a dummy variable.
We are working with article titles today primarily because they are short! It means we can look at each word in the text and make sure our code is doing what we think it does. However, we will work with all kinds of data sets in this course! We will analyze articles, books, and Amazon reviews, song lyrics, etc.
All of this is to say we are starting with clickbait and article titles, but this won’t be the whole course!
We are going to start to explore text data by focusing on only the first row in the data set. To print out the first row so we can see it, run the following code:
In this code, headlines
is the name of the data set and
the [1, ]
part of the code tells R to print the first row.
This means that to print a row in R, we use the format:
Question 2
Run the code needed to print out the 3rd row in the
headlines
data set.
Let’s go back to the first title.
Question 3
Based only on the title, do you think the first article is clickbait or not? Clearly justify your reasoning in 1-2 sentences.
Note: There is no one right answer here, I just want to see how you thought this through!
Regardless of how you answered Question 3, you probably mentioned something that had to do with the content, meaning the words, in the title. We could be interested in looking for specific words, or certain types of words, or even how many words are present.
All of this means that a key part of working with text data is to break down text into words. This process is called tokenizing.
Tokenizing
Tokenizing means breaking text down into smaller pieces. Let’s start by breaking up text into individual words.
Question 4
If we tokenize the first title in our data set, what would the result look like?
Note: I don’t want you to code anything yet! Just show me what it would look like to tokenize the title by hand.
This would be very tedious if we had to do it by hand for all 2000 titles! Luckily, tokenzing the first title into individual words can be done using a short code:
Some of us may know this code structure, but for many of us, we are
looking at that code going “WHAT?”. What on earth is the
|>
thing??
Well, the symbol |>
you see in the code above is
called a pipe. We will be using it a
lot in this course. Basically, you can translate this symbol in
your head into the phrase “and then”.
This means that the code structure you see above means this:
In this case, the code takes the first title in the data set and tokenizes it into individual words.
# Start with the first row in the data set
headlines[1, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title)
Question 5
Write and run the code needed to tokenize the 12th title (headline) in the data set.
The unnest_tokens
function takes two
inputs.
- The second input (
title
) tells us the column in the data set we want to tokenize. In other words, this tells R where to look in the data set to find the text we want to tokenize. - The first input (
word
) is the name we want to give to the column where the tokenized text will show up in our output. This means you can useWord
orwords
or something similar if you like.
Question 6
Adapt your code from Question 5 so the name of the 3rd column is
tokens
.
You will notice that the code in Question 6 prints out 3 columns. The
first two columns (clickbait
and ids
) are the
same for every row in the output, because all the words come from just
one title. Basically, the unnest_tokens
code takes a single
row of data and then creates multiple rows out of it, but the only thing
that changes is that the single piece of text is now multiple rows, each
containing one word of the original text.
This can be fine, but we don’t really care about the
clickbait
or ids
columns right now. We only
want to see the words. We can get only the words by adding one more
command to our code:
# Start with the first row in the data set
headlines[1, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title) |> #AND THEN
# Print out only the column called word
select(word)
And right here, this is the reason why we use pipes
(|>
). We can keep adding to our commands - do this, and
then do this, and then do this. This allows us to perform more
complicated coding without having a lot of messy lines of code.
The select
command in R tells R to choose only certain
columns, and ignore the rest.
Question 7
Run the code needed to tokenize the 12th title (headline) in the data
set (calling the 3rd column word
) but only
print out the word
column.
Looking at Tokenized Output
Okay, so now we know how to tokenize the first title in R.
# Start with the first row in the data set
headlines[1, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title) |> #AND THEN
# Print out only the column called word
select(word)
However…let’s look at what we get when we do this.
Question 8
Take a look at the result of the first title, tokenized. This command does something to the title besides tokenize it. What is it?
Hint: If you get stuck, look at what the first title looked like before you tokenized it!
What you found in Question 8 will always happen when you use the
unnest_tokens
function. There is a way to stop if from
doing that if we don’t want to, but we’ll learn more about that later in
the course!
Question 9
Why might this extra step performed by the unnest_tokens
function be useful when we analyze text data?
Note: There is more than one correct answer to this!
Question 10
Take a look at again the result of the first title, tokenized. Something in it should look not quite right. What is it?
Your answer to Question 10 highlights a key thing we are going to come up against when working with text data - data cleaning.
Text data is messy! We have punctuation marks, mis-spellings, emojis, numbers, all kinds of things that can pop up in text. We may be interested in looking at some of these things, but we may also need to remove them before we perform an analysis. Getting the text data ready for analysis is called cleaning the text.
In this case, the issue we are running into is a hyphen (-). The phrase “re-elected” is hyphenated, and this is causing R to think that “re” and “elected” are two different words. To fix this, we need to add one thing to the end of the 2nd line of code.
# Start with the first row in the data set
headlines[1, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title, token = "regex") |> #AND THEN
# Print out only the column called word
select(word)
Question 11
What is new in the code above? In other words, what did we add to the code to fix the issue with the hyphen?
We won’t always need this little addition to the code, but we often do. We also often need other little things to help clean up the code. We will see more as we move through the course!
Question 12
Take a look at the 2nd title in the data set. Tokenize the text (1) without and (2) with the code addition that handles punctuation.
What differences do you see in the output?
In this case, for an analysis, how to you think we should handle punctuation? Note: There is no right answer here, I just want to see how you are thinking!
Storing Things in R
Okay, so now we can tokenize text. Great! However, if we want to actually use that data for anything later, it helps to be able to save the results somewhere for safe keeping. We do this by storing the results.
For example, to tokenize the first title, we know that we use:
To store the tokenized text of the first title rather than printing it out, we add just one more step to the code:
What does this code do? Well, take a look at your upper hand panel in
R and you will now see a new data set called
tidy_headlines1
that contains the tokenized version of the
first title!
So, to store things in R, all we need to do is choose the name we
want the results to be stored under. I chose
tidy_headlines1
to be clear what I’m storing: the tokenized
version of the first title. We then use the <-
sign to
tell R that we want to store the tokenized text under this name. Think
of this code as:
NOTE: We will use the structure
tidy_datasetname
whenever we tokenize text in this course.
This makes it easy when you are looking at data set in R to know exactly
what is in there. If you see tidy_datasetname
, you know you
are looking at a tokenized version of datasetname
.
Question 13
Suppose we tokenize all the titles in the headlines
data
set. What name would we store these tokenized titles under?
As hinted at in the previous question, it turns out we can tokenize more than one title at once! For example, to tokenize the first two titles, we can use this code:
The 1:2
part of the code tells R to grab all rows
starting at 1 and ending at 2 (so rows 1 to 2).
Question 14
Write and run a code to tokenize and store the first 20 titles in the data set, keeping hyphenated words together. Show your code, but do NOT print all the tokenized titles in your file!!
Recap so far
Let’s pause for a minute, because this seems like a lot of code. It is, and it will feel that way in the beginning as we get used to it! However, we will use the same coding structure over and over again in this course, so it will start to feel more familiar soon! This is also why we have a lot of labs early in the semester so that we can practice. If you ever feel stuck, let me know!!
So far, we have seen the following code:
To Tokenize Text
To Tokenize Text (keeping hyphenated words together)
To Store Tokenized Text
To Print a row in a dataset
Before you submit
Three last steps before we knit, and then you will be done with Lab 1!
- Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
- If you are working with a partner, make sure their name and yours is on the top of the file.
- You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.
Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!
References
The data set used in this lab is the sample_headlines
data set downloaded from
https://github.com/nicholasjhorton/textclassificationexamples/tree/master.
Data Citation: Horton, Nicholas J. Text Classification
Examples, Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.