STA 279 Lab 1
Complete all Questions.
The Goal
All of us used R in STA 112, but the code we need for text will be a little different. Today, we are going to warm up a little by reminding ourselves about coding in R as well as getting started with the the code that will help us work with text data in R.
A lot of the work in using text data comes in being able to manipulate what you have (a string of words and phrases) into what you need (usable content for drawing conclusions and building models). Today, we are going to start to explore the coding structure we will use to do this.
Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.
Getting Started
Clearing the Work space
You learned R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.
To clear all of the data in the upper right hand panel, look at the top of the panel and find an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.
Installing Packages
Once we have opened up an RMarkdown file, the first thing we need to do is load the packages we need to run our analysis. A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package.This means that our first step is to remind ourselves how to install an R package.
To install a package:
- Look at the top of your RStudio screen and find “Tools”.
- From the drop down menu, choose “Install Packages”.
- In the white box, type
dplyr, and then hit install.
- In the white box, type
We actually need a few packages for today: tidytext,
tidyr, dplyr, and ggplot2. We
will need these packages for essentially everything we do in this
course. Luckily you only have to install each package once!!!
Go ahead and install all 4 packages.
Loading Packages
Now that we have the package installed, we need to tell R to actually use that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).
To tell RMarkdown we are about to give it code, there are two ways to do it.
- Option 1: Look at the top of your Markdown file, and find Code. Click it. From the drop-down menu, choose Insert Chunk. Click it! A gray box should appear in your Markdown file. This is a chunk.
- Option 2: In the top gray bar of your Markdown file, look for a small green C (shown below). Click that, and choose R! A gray box should appear in your Markdown file. This is a chunk.
When you are done, you should have a code chunk that looks like this:
Anything we put inside a chunk (that gray box) will be treated as computer code. Go ahead and put the code below inside a chunk.
suppressMessages( library(tidytext) )
suppressMessages( library(tidyr) )
suppressMessages( library(dplyr) )
suppressMessages( library(ggplot2) ) This means you should have something like this:
Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play). This tells R to run the code!
When you run this code, it will look like nothing happens…and that’s because all we told R to do was get ready to use these packages. With such a command, nothing will show up on our screen, and that’s okay!
Note: In the code above, you’ll notice I have
suppressMessages around each library. This is not strictly
necessary, but if you don’t include it, R prints out messages / comments
about the libraries when you load them in, for example:
Attaching package: dplyr. This is annoying when we want to
have professionally formatted files, so suppressMessages
just tells R we don’t want to see those.
Loading the Data
Now that we have the libraries loaded, we need the data. In our lab today, we are going to continue working with article titles as we did in the first class. We will work with \(n= 2000\) titles, and to load the data, copy and paste the following into a chunk in R and press play:
headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")The code above will load a data set called headlines
with \(n = 2000\) rows and 3 columns
into R. The columns are:
title: the title (headline) of the articleclickbait: a human generated indicator for whether or not an article is clickbait; FALSE means it is not clickbait and TRUE means it is clickbait.ids: a number assigned to each article; think of this like an article identifier.
Now that we have data, we are ready to use it to start answering our lab questions! We will use Markdown (or Quarto if you prefer) files in our course to submit labs. When you are answering a lab question, your set up should look like this:
The ## allows your Question numbers of show up in bold
so I can easily find them for grading. The * puts your text
in italics. Let’s try it!
Question 1
The variable clickbait is an indicator variable. Remind
me - what is an indicator variable? Hint: This can also be
called a dummy variable.
We are working with article titles today primarily because they are short! It means we can look at each word in the text and make sure our code is doing what we think it does. However, we will work with all kinds of data sets in this course! We will analyze articles, books, and Amazon reviews, song lyrics, etc.
All of this is to say we are starting with clickbait and article titles, but this won’t be the whole course!
We are going to start to explore text data by focusing on only the first row in the data set. To print out the first row so we can see it, run the following code:
In this code, headlines is the name of the data set and
the [1, ] part of the code tells R to print the first row.
This means that to print a row in R, we use the format:
Question 2
Run the code needed to print out the 3rd row in the
headlines data set.
Let’s look at the 112th title.
Question 3
Based only on the title, do you think the article (112) is clickbait or not? Clearly justify your reasoning in 1-2 sentences.
Note: There is no one right answer here, I just want to see how you thought this through!
Regardless of how you answered Question 3, you probably mentioned something that had to do with the content, meaning the words, in the title. We could be interested in looking for specific words, or certain types of words, or even how many words are present.
All of this means that a key part of working with text data is to break down text into words. This process is called tokenizing.
Tokenizing
Tokenizing means breaking text down into smaller pieces. Let’s start by breaking up text into individual words.
Question 4
If we tokenize the 112th title in our data set, what would the result look like?
Note: I don’t want you to code anything yet! Just show me what it would look like to tokenize the title by hand.
This would be very tedious if we had to do it by hand for all 2000 titles! Luckily, tokenzing a title into individual words can be done using a short code:
Some of us may know this code structure, but for many of us, we are
looking at that code going “WHAT?”. What on earth is the
|> thing??
Well, the symbol |> you see in the code above is
called a pipe. We will be using it a
lot in this course. Basically, you can translate this symbol in
your head into the phrase “and then”.
This means that the code structure you see above means this:
In this case, the code takes the first title in the data set and tokenizes it into individual words.
# Start with the 112th row in the data set
headlines[112, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title)Question 5
Write and run the code needed to tokenize the 14th title in the data set.
The unnest_tokens function takes two
inputs.
- The second input (
title) tells us the column in the data set we want to tokenize. In other words, this tells R where to look in the data set to find the text we want to tokenize. - The first input (
word) is the name we want to give to the column where the tokenized text will show up in our output. This means you can useWordorwordsor something similar if you like.
Question 6
Adapt your code from Question 5 so the name of the 3rd column is
tokens.
You will notice that the code in Question 6 prints out 3 columns. The
first two columns (clickbait and ids) are the
same for every row in the output, because all the words come from just
one title. Basically, the unnest_tokens code takes a single
row of data and then creates multiple rows out of it, but the only thing
that changes is that the single piece of text is now multiple rows, each
containing one word of the original text.
This can be fine, but we don’t really care about the
clickbait or ids columns right now. We only
want to see the words. We can get only the words by adding one more
command to our code:
# Start with the first row in the data set
headlines[112, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title) |> #AND THEN
# Print out only the column called word
select(word)
And right here, this is the reason why we use pipes
(|>). We can keep adding to our commands - do this, and
then do this, and then do this. This allows us to perform more
complicated coding without having a lot of messy lines of code.
The select command in R tells R to choose only certain
columns, and ignore the rest.
Question 7
Run the code needed to tokenize the 14th title (headline) in the data
set (calling the 3rd column word) but only
print out the word column.
Looking at Tokenized Output
Okay, so now we know how to tokenize in R.
# Start with the first row in the data set
headlines[112, ] |> # AND THEN
# Break the first title down into words
unnest_tokens(word,title) |> #AND THEN
# Print out only the column called word
select(word)However…let’s look at what we get when we do this.
Question 8
Take a look at the result of 112th title, tokenized. This command does something to the title besides tokenize it. What is it?
Hint: If you get stuck, look at what the title looked like before you tokenized it!
Question 9
Why might this extra step performed by the unnest_tokens
function be useful when we analyze text data?
Note: There is more than one correct answer to this!
Storing Things in R
Okay, so now we can tokenize text. Great! However, if we want to actually use that data for anything later, it helps to be able to save the results somewhere for safe keeping. We do this by storing the results.
For example, to tokenize the 112th title, we know that we use:
To store the tokenized text rather than printing it out, we add just one more step to the code:
What does this code do? Well, take a look at your upper hand panel in
R and you will now see a new data set called
tidy_headlines112 that contains the tokenized version of
the 112th title!
Think of this code as:
So, to store things in R, all we need to do is choose the name we
want the results to be stored under. I chose
tidy_headlines112 to be clear what I’m storing: the
tokenized version of the 112th title. We then use the <-
sign to tell R that we want to store the tokenized text under this
name.
NOTE: We will use the structure
tidy_datasetname whenever we tokenize text in this course.
This makes it easy when you are looking at data set in R to know exactly
what is in there. If you see tidy_datasetname, you know you
are looking at a tokenized version of datasetname.
Question 10
Suppose we tokenize all the titles in the headlines data
set. What name would we store these tokenized titles under?
As hinted at in the previous question, it turns out we can tokenize more than one title at once! For example, to tokenize the first two titles, we can use this code:
The 1:2 part of the code tells R to grab all rows
starting at 1 and ending at 2 (so rows 1 to 2).
Question 11
Write and run a code to tokenize and store the first 20 titles in the data set. Show your code, but do NOT print all the tokenized titles in your file!!
Recap so far
Let’s pause for a minute, because this seems like a lot of code. It is, and it will feel that way in the beginning as we get used to it! However, we will use the same coding structure over and over again in this course, so it will start to feel more familiar soon! This is also why we have a lot of labs early in the semester so that we can practice. If you ever feel stuck, let me know!!
So far, we have seen the following code:
To Tokenize Text
To Store Tokenized Text
To Print a row in a dataset
Feature Engineering
Now that we know how to load text data, and tokenize it, let’s create our first feature! In class, we decided that the word “you” was pretty important for determining whether something is clickbait, but that was only with our small sample of 20 titles. Would the same thing be true now that we have 2000 titles?
To find out, let’s create a feature that tells us whether or not the word “you” is in each title. This process of creating our own features is called feature engineering.
To create our feature, we are going to use a function called
grepl. This is very useful function that detects (finds!)
words in a piece of text. However…neither the name of the function nor
the structure of it is very intuitive, so bear with me a minute!
The structure of the code is:
This function looks at text and checks to see if a certain word is in
the text. If it is, the function returns TRUE. If not, it
returns FALSE. This means that for checking to see if a
certain title contains the word “you”, we use the code below,
where you need to replace the word row with the
number of the title you want to check!
The \\b \\b part of the code looks pretty weird, right?
We are not looking for bs in the text! Instead, the \\b \\b
notation tells R to look for the exact word “you” in
text. If you do not include this, grepl counts “you”,
“your”, “yours”, “you’re”, and anything with “you” in it as being “you”.
It can useful to look at word stems in this way, but
for today, we want to look at the word “you” specifically.
Question 12
Adapt the code above to see if the 1978th title contains the word
you. State your answer!
We can do this for one row at a time, but ideally we want to create a column in our data set that shows whether or not each title contains the word “you”. To start off with, let’s create a column in our data set to hold our indicator for whether or not the title contains “you”.
We already know that <- stores stuff in R. In this
code, we are creating a blank column called you in our
headlines data set to store our feature which
indicates whether or not each title contains the word
“you”. In R, if you mention a column that is not in your data set
already, R creates it! At the moment, we are just filling the feature
with empty space, so we use NA as a placeholder until we
add the real feature.
Now is a good time to open your data set and make sure you have 4 columns, with the last one blank!
Once we have the column in place, we can fill in the feature
Question 13
There is a really small code difference between the code you used in
Question 12 and the code to the right of <- above. What
is the difference?
This is what allows us to apply the code to every row instead of one row at a time!
And that’s it! We have our first feature!! We have created a new column in our data set that indicates whether or not each title contains the word “you”.
EDA
Whenever we create a feature, it is very important to explore that feature. This allows us to (1) make sure we created the feature correctly and (2) start to use this feature for our analysis. The feature we have created is a categorical feature, so a good way to explore it is by using a table.
We tell R we want to use the headlines data set and the
variable you by using the code headlines$you
(dataset$variable). To create a table, we use the
table(whatWeWantToMakeATableWith) command to actually make
the table.
Question 14
How many titles contain the word “you”?
If your answer to Question 14 surprises you…it should. It turns out that we missed a step!!
R is what we call a case-sensitive coding language. This means that “you” and “You” are two completely different words as far as R is concerned. This means that it is really important to convert all of our words to lower case before we start working with text. We can do things like count capital letters if we need to, but for most of what we will do, lower case will be the way to go!
We saw that when we tokenized text, converting all the words to lower
case came for free. For grepl, this is not the case, but we
can add it!
Question 15
Based on the code above, what command do you think makes text lower case in R?
Question 16
After running the new code that converts everything to lower case before looking for “you”, update your table from Question 14 (just copy and paste the code and run it again!). How many titles contain the word “you” now?
Okay great! Now let’s make our table a little more professional.
The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit your document. See how nicely the table gets formatted?
Question 17
Adapt the code above to call the first column “Contains You” and the second column “Titles”. Show your table!
You and Clickbait
The entire point of creating this feature was to try and see if looking for the word “you” in a title was a good way to determine whether or not a title is clickbait. Now that we have the feature, let’s explore this.
There are a few different ways we could visualize the data, but let’s start with a two-way table. This is just a table to visualize the relationship between two categorical variables. For our purposes, these two variables are whether or the title contains “you” and whether or not the article is clickbait.
To make the table, we can use:
The only issue with this is that the labels are a little confusing. What do true and false mean in this setting?? To add labels, we can use the following:
# Create the table
holder <- table( headlines$you, headlines$clickbait)
# Add row names
rownames(holder) <- c("You: No", "You: Yes")
# Add column names
colnames(holder) <- c("Not Clickbait", "Clickbait")
# Format the table
knitr::kable( holder )Question 18
Based on the table above, does it look like whether or not a title contains the word “you” is a helpful feature in these data for determining if a title is clickbait? Explain your reasoning.
Question 19
Now, let’s have a little fun!
Choose any word that you want (besides “you”!!) that you think might a useful feature for detecting clickbait. Explain briefly why you chose it.
Build a feature that indicates whether or not each title contains that word. Show a formatted two way table of your feature vs. clickbait.
Based on the table in (b), does it look like whether or not a title contains the word you chose is a helpful feature in these data for determining if a title is clickbait? Explain your reasoning.
This is where text analysis really gets into the creative side of working with data. It’s our job to think critically and creatively and decide what features might help us as we analyze text! We will get into a lot more of this, as well as how we can use these features for prediction, soon!!
Before you submit
Three last steps before we knit, and then you will be done with Lab 1!
- Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
- If you are working with a partner, make sure their name and yours is on the top of the file.
- You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.
Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!
References
The data set used in this lab is a subset of the
sample_headlines data set downloaded from
https://github.com/nicholasjhorton/textclassificationexamples/tree/master.
Data Citation: Horton, Nicholas J. Text Classification
Examples, Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.