STA 279 Lab 1

Complete all Questions.

The Goal

All of us used R in STA 112, but the code we need for text will be a little different. Today, we are going to warm up a little by reminding ourselves about coding in R as well as getting started with the the code that will help us work with text data in R.

A lot of the work in using text data comes in being able to manipulate what you have (a string of words and phrases) into what you need (usable content for drawing conclusions and building models). Today, we are going to start to explore the coding structure we will use to do this.

Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

Getting Started

Clearing the Work space

You learned R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.

To clear all of the data in the upper right hand panel, look at the top of the panel and find an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.

Installing Packages

Once we have opened up an RMarkdown file, the first thing we need to do is load the packages we need to run our analysis. A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package.This means that our first step is to remind ourselves how to install an R package.

To install a package:

    1. Look at the top of your RStudio screen and find “Tools”.
    1. From the drop down menu, choose “Install Packages”.
    1. In the white box, type dplyr, and then hit install.

We actually need a few packages for today: tidytext, tidyr, dplyr, and ggplot2. We will need these packages for essentially everything we do in this course. Luckily you only have to install each package once!!!

Go ahead and install all 4 packages.

Loading Packages

Now that we have the package installed, we need to tell R to actually use that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).

To tell RMarkdown we are about to give it code, there are two ways to do it.

  • Option 1: Look at the top of your Markdown file, and find Code. Click it. From the drop-down menu, choose Insert Chunk. Click it! A gray box should appear in your Markdown file. This is a chunk.
  • Option 2: In the top gray bar of your Markdown file, look for a small green C (shown below). Click that, and choose R! A gray box should appear in your Markdown file. This is a chunk.

When you are done, you should have a code chunk that looks like this:

Anything we put inside a chunk (that gray box) will be treated as computer code. Go ahead and put the code below inside a chunk.

suppressMessages( library(tidytext) ) 
suppressMessages( library(tidyr) )
suppressMessages( library(dplyr) )
suppressMessages( library(ggplot2) ) 

This means you should have something like this:

Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play). This tells R to run the code!

When you run this code, it will look like nothing happens…and that’s because all we told R to do was get ready to use these packages. With such a command, nothing will show up on our screen, and that’s okay!

Note: In the code above, you’ll notice I have suppressMessages around each library. This is not strictly necessary, but if you don’t include it, R prints out messages / comments about the libraries when you load them in, for example: Attaching package: dplyr. This is annoying when we want to have professionally formatted files, so suppressMessages just tells R we don’t want to see those.

Loading the Data

Now that we have the libraries loaded, we need the data. In our lab today, we are going to continue working with article titles as we did in the first class. We will work with \(n= 2000\) titles, and to load the data, copy and paste the following into a chunk in R and press play:

headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")

The code above will load a data set called headlines with \(n = 2000\) rows and 3 columns into R. The columns are:

  • title: the title (headline) of the article
  • clickbait: a human generated indicator for whether or not an article is clickbait; FALSE means it is not clickbait and TRUE means it is clickbait.
  • ids: a number assigned to each article; think of this like an article identifier.

Now that we have data, we are ready to use it to start answering our lab questions! We will use Markdown (or Quarto if you prefer) files in our course to submit labs. When you are answering a lab question, your set up should look like this:

The ## allows your Question numbers of show up in bold so I can easily find them for grading. The * puts your text in italics. Let’s try it!

Question 1

The variable clickbait is an indicator variable. Remind me - what is an indicator variable? Hint: This can also be called a dummy variable.

We are working with article titles today primarily because they are short! It means we can look at each word in the text and make sure our code is doing what we think it does. However, we will work with all kinds of data sets in this course! We will analyze articles, books, and Amazon reviews, song lyrics, etc.

All of this is to say we are starting with clickbait and article titles, but this won’t be the whole course!

We are going to start to explore text data by focusing on only the first row in the data set. To print out the first row so we can see it, run the following code:

headlines[ 1 , ]

In this code, headlines is the name of the data set and the [1, ] part of the code tells R to print the first row. This means that to print a row in R, we use the format:

dataset[ row we want, ]

Question 2

Run the code needed to print out the 3rd row in the headlines data set.

Let’s look at the 112th title.

headlines[ 112 , ]

Question 3

Based only on the title, do you think the article (112) is clickbait or not? Clearly justify your reasoning in 1-2 sentences.

Note: There is no one right answer here, I just want to see how you thought this through!

Regardless of how you answered Question 3, you probably mentioned something that had to do with the content, meaning the words, in the title. We could be interested in looking for specific words, or certain types of words, or even how many words are present.

All of this means that a key part of working with text data is to break down text into words. This process is called tokenizing.

Tokenizing

Tokenizing means breaking text down into smaller pieces. Let’s start by breaking up text into individual words.

headlines[ 112 , ]

Question 4

If we tokenize the 112th title in our data set, what would the result look like?

Note: I don’t want you to code anything yet! Just show me what it would look like to tokenize the title by hand.

This would be very tedious if we had to do it by hand for all 2000 titles! Luckily, tokenzing a title into individual words can be done using a short code:

headlines[112,] |>
  unnest_tokens(word, title)

Some of us may know this code structure, but for many of us, we are looking at that code going “WHAT?”. What on earth is the |> thing??

Well, the symbol |> you see in the code above is called a pipe. We will be using it a lot in this course. Basically, you can translate this symbol in your head into the phrase “and then”.

This means that the code structure you see above means this:

Start with this data |> AND THEN
  do this with it

In this case, the code takes the first title in the data set and tokenizes it into individual words.

# Start with the 112th row in the data set 
headlines[112, ] |> # AND THEN
  # Break the first title down into words
  unnest_tokens(word,title)

Question 5

Write and run the code needed to tokenize the 14th title in the data set.

The unnest_tokens function takes two inputs.

  • The second input (title) tells us the column in the data set we want to tokenize. In other words, this tells R where to look in the data set to find the text we want to tokenize.
  • The first input (word) is the name we want to give to the column where the tokenized text will show up in our output. This means you can use Word or words or something similar if you like.

Question 6

Adapt your code from Question 5 so the name of the 3rd column is tokens.

You will notice that the code in Question 6 prints out 3 columns. The first two columns (clickbait and ids) are the same for every row in the output, because all the words come from just one title. Basically, the unnest_tokens code takes a single row of data and then creates multiple rows out of it, but the only thing that changes is that the single piece of text is now multiple rows, each containing one word of the original text.

This can be fine, but we don’t really care about the clickbait or ids columns right now. We only want to see the words. We can get only the words by adding one more command to our code:

# Start with the first row in the data set 
headlines[112, ] |> # AND THEN

  # Break the first title down into words
  unnest_tokens(word,title) |> #AND THEN
  
  # Print out only the column called word
  select(word)
  

And right here, this is the reason why we use pipes (|>). We can keep adding to our commands - do this, and then do this, and then do this. This allows us to perform more complicated coding without having a lot of messy lines of code.

The select command in R tells R to choose only certain columns, and ignore the rest.

Question 7

Run the code needed to tokenize the 14th title (headline) in the data set (calling the 3rd column word) but only print out the word column.

Looking at Tokenized Output

Okay, so now we know how to tokenize in R.

# Start with the first row in the data set 
headlines[112, ] |> # AND THEN

  # Break the first title down into words
  unnest_tokens(word,title) |> #AND THEN
  
  # Print out only the column called word
  select(word)

However…let’s look at what we get when we do this.

Question 8

Take a look at the result of 112th title, tokenized. This command does something to the title besides tokenize it. What is it?

Hint: If you get stuck, look at what the title looked like before you tokenized it!

Question 9

Why might this extra step performed by the unnest_tokens function be useful when we analyze text data?

Note: There is more than one correct answer to this!

Storing Things in R

Okay, so now we can tokenize text. Great! However, if we want to actually use that data for anything later, it helps to be able to save the results somewhere for safe keeping. We do this by storing the results.

For example, to tokenize the 112th title, we know that we use:

headlines[112,] |>
  unnest_tokens(word, title)

To store the tokenized text rather than printing it out, we add just one more step to the code:

tidy_headlines112 <- headlines[112,] |>
  unnest_tokens(word, title)

What does this code do? Well, take a look at your upper hand panel in R and you will now see a new data set called tidy_headlines112 that contains the tokenized version of the 112th title!

Think of this code as:

THIS IS WHERE I WANT TO STORE IT <- THIS IS THE STUFF I WANT TO STORE

So, to store things in R, all we need to do is choose the name we want the results to be stored under. I chose tidy_headlines112 to be clear what I’m storing: the tokenized version of the 112th title. We then use the <- sign to tell R that we want to store the tokenized text under this name.

NOTE: We will use the structure tidy_datasetname whenever we tokenize text in this course. This makes it easy when you are looking at data set in R to know exactly what is in there. If you see tidy_datasetname, you know you are looking at a tokenized version of datasetname.

Question 10

Suppose we tokenize all the titles in the headlines data set. What name would we store these tokenized titles under?

As hinted at in the previous question, it turns out we can tokenize more than one title at once! For example, to tokenize the first two titles, we can use this code:

tidy_headlines1and2 <- headlines[1:2,] |>
  unnest_tokens(word, title)

The 1:2 part of the code tells R to grab all rows starting at 1 and ending at 2 (so rows 1 to 2).

Question 11

Write and run a code to tokenize and store the first 20 titles in the data set. Show your code, but do NOT print all the tokenized titles in your file!!

Recap so far

Let’s pause for a minute, because this seems like a lot of code. It is, and it will feel that way in the beginning as we get used to it! However, we will use the same coding structure over and over again in this course, so it will start to feel more familiar soon! This is also why we have a lot of labs early in the semester so that we can practice. If you ever feel stuck, let me know!!

So far, we have seen the following code:

To Tokenize Text

dataset |>
  unnest_tokens( word , column )

To Store Tokenized Text

storageName <- dataset |>
  unnest_tokens( word , column )

To Print a row in a dataset

dataset[ row we want, ]

Feature Engineering

Now that we know how to load text data, and tokenize it, let’s create our first feature! In class, we decided that the word “you” was pretty important for determining whether something is clickbait, but that was only with our small sample of 20 titles. Would the same thing be true now that we have 2000 titles?

To find out, let’s create a feature that tells us whether or not the word “you” is in each title. This process of creating our own features is called feature engineering.

To create our feature, we are going to use a function called grepl. This is very useful function that detects (finds!) words in a piece of text. However…neither the name of the function nor the structure of it is very intuitive, so bear with me a minute!

The structure of the code is:

grepl("\\bword you want to find\\b", text where you are looking for the word)

This function looks at text and checks to see if a certain word is in the text. If it is, the function returns TRUE. If not, it returns FALSE. This means that for checking to see if a certain title contains the word “you”, we use the code below, where you need to replace the word row with the number of the title you want to check!

grepl("\\byou\\b", headlines$title[row])

The \\b \\b part of the code looks pretty weird, right? We are not looking for bs in the text! Instead, the \\b \\b notation tells R to look for the exact word “you” in text. If you do not include this, grepl counts “you”, “your”, “yours”, “you’re”, and anything with “you” in it as being “you”. It can useful to look at word stems in this way, but for today, we want to look at the word “you” specifically.

Question 12

Adapt the code above to see if the 1978th title contains the word you. State your answer!

We can do this for one row at a time, but ideally we want to create a column in our data set that shows whether or not each title contains the word “you”. To start off with, let’s create a column in our data set to hold our indicator for whether or not the title contains “you”.

headlines$you <- NA

We already know that <- stores stuff in R. In this code, we are creating a blank column called you in our headlines data set to store our feature which indicates whether or not each title contains the word “you”. In R, if you mention a column that is not in your data set already, R creates it! At the moment, we are just filling the feature with empty space, so we use NA as a placeholder until we add the real feature.

Now is a good time to open your data set and make sure you have 4 columns, with the last one blank!

Once we have the column in place, we can fill in the feature

headlines$you <- grepl("\\byou\\b", headlines$title)

Question 13

There is a really small code difference between the code you used in Question 12 and the code to the right of <- above. What is the difference?

This is what allows us to apply the code to every row instead of one row at a time!

And that’s it! We have our first feature!! We have created a new column in our data set that indicates whether or not each title contains the word “you”.

EDA

Whenever we create a feature, it is very important to explore that feature. This allows us to (1) make sure we created the feature correctly and (2) start to use this feature for our analysis. The feature we have created is a categorical feature, so a good way to explore it is by using a table.

We tell R we want to use the headlines data set and the variable you by using the code headlines$you (dataset$variable). To create a table, we use the table(whatWeWantToMakeATableWith) command to actually make the table.

table(headlines$you)

Question 14

How many titles contain the word “you”?

If your answer to Question 14 surprises you…it should. It turns out that we missed a step!!

R is what we call a case-sensitive coding language. This means that “you” and “You” are two completely different words as far as R is concerned. This means that it is really important to convert all of our words to lower case before we start working with text. We can do things like count capital letters if we need to, but for most of what we will do, lower case will be the way to go!

We saw that when we tokenized text, converting all the words to lower case came for free. For grepl, this is not the case, but we can add it!

headlines$you <- grepl("\\byou\\b", tolower(headlines$title))

Question 15

Based on the code above, what command do you think makes text lower case in R?

Question 16

After running the new code that converts everything to lower case before looking for “you”, update your table from Question 14 (just copy and paste the code and run it again!). How many titles contain the word “you” now?

Okay great! Now let’s make our table a little more professional.

knitr::kable(table(headlines$you), col.names=c("Name 1", "Name 2") )

The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit your document. See how nicely the table gets formatted?

Question 17

Adapt the code above to call the first column “Contains You” and the second column “Titles”. Show your table!

You and Clickbait

The entire point of creating this feature was to try and see if looking for the word “you” in a title was a good way to determine whether or not a title is clickbait. Now that we have the feature, let’s explore this.

There are a few different ways we could visualize the data, but let’s start with a two-way table. This is just a table to visualize the relationship between two categorical variables. For our purposes, these two variables are whether or the title contains “you” and whether or not the article is clickbait.

To make the table, we can use:

knitr::kable( table( headlines$you, headlines$clickbait) )

The only issue with this is that the labels are a little confusing. What do true and false mean in this setting?? To add labels, we can use the following:

# Create the table 
holder <- table( headlines$you, headlines$clickbait)

# Add row names 
rownames(holder) <- c("You: No", "You: Yes")

# Add column names 
colnames(holder) <- c("Not Clickbait", "Clickbait")

# Format the table 
knitr::kable( holder )

Question 18

Based on the table above, does it look like whether or not a title contains the word “you” is a helpful feature in these data for determining if a title is clickbait? Explain your reasoning.

Question 19

Now, let’s have a little fun!

  1. Choose any word that you want (besides “you”!!) that you think might a useful feature for detecting clickbait. Explain briefly why you chose it.

  2. Build a feature that indicates whether or not each title contains that word. Show a formatted two way table of your feature vs. clickbait.

  3. Based on the table in (b), does it look like whether or not a title contains the word you chose is a helpful feature in these data for determining if a title is clickbait? Explain your reasoning.

This is where text analysis really gets into the creative side of working with data. It’s our job to think critically and creatively and decide what features might help us as we analyze text! We will get into a lot more of this, as well as how we can use these features for prediction, soon!!

Before you submit

Three last steps before we knit, and then you will be done with Lab 1!

    1. Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
    1. If you are working with a partner, make sure their name and yours is on the top of the file.
    1. You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.

Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!

References

The data set used in this lab is a subset of the sample_headlines data set downloaded from https://github.com/nicholasjhorton/textclassificationexamples/tree/master. Data Citation: Horton, Nicholas J. Text Classification Examples, Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.