STA 175 DataFest Activity 7
The Goal
For our activity today, we are going to work with a brand new kind of data - text data! Text data means data that comes in the form of words or sentences. This could be a review posted online, a book, a journal article, or anything else that involves words. Our goal for today will be to start exploring how we work with this sort of data. We will continue this work next week!
The Data
One of the forms of text data that is analyzed frequently is online reviews. It is common for customers to be asked to provided open-ended feedback on a product or service, meaning that this data is provided to companies in the form of text data. For today, we are looking at reviews provided about airlines.
The data we will be working with today has 14640 rows. Each row in the data set represents a tweet that someone posted about an airline. To download the data, follow this link or copy and paste the following into a browser window: https://drive.google.com/file/d/1O1yBrvxwuGIYSdplRhYeDi-dB4jQugI-/view?usp=share_link
This data set has 15 columns. For the moment, we are interested in
the column called text
, which contains the actual text of
the Tweet. Warning: These are real Tweets, which means
they may contain objectionable content and language.
Exploring the data: Tokenizing
Now that we have the data set loaded, let’s start actually looking at some text data.
Question 1
What is the text of the Tweet shown in the 15th row in the data set?
Okay, so this is a short sentence. Now what? Well, the power of text data lies in looking at what are called tokens. Generally, a token is a word, though there are some cases where we use short phrases as tokens. We know that Tweets are composed of written text, and that means that to understand the content of the text we need to look at the individual words that make up the Tweet.
There are two packages we need in order to tokenize our text data:
# Load dplyr
library(dplyr)
# Load tidytext
library(tidytext)
The first package is familiar to us, but the second
(tidytext
) is specifically designed to help us work with
text data. As the name of the package suggests, this is a
tidy
package, which means it works with the tidyverse in R.
Because of this, before we can work with the text, we need to put it in
the form of a tibble (a tidy table). This is the format
that the package is expecting. To make the conversion, use the
following:
# Pull the text of the 15th Tweet
<- Tweets$text[15]
textdata
# Convert to the form we need for tidyText
<- tibble(textdata ) textdata
Once the data is in the correct form, we are ready to tokenize the text of the 15th Tweet!
# Break the text into tokens
%>%
textdata unnest_tokens(word, textdata )
What appears on your screen is 2 words. These are the two unique words in the text. What happens if the text has more than two words?
Question 2
Tokenize text of the Tweet for the 13th row in the data set. How many unique words does this text contain?
Aside from converting the text into tokens, two other changes are
made to the text data when we use the unnest_tokens()
command.
Question 3
What are these two other changes? Hint: Look at the original text and then look at the output from tokenizing.
Okay, so we can break our text into words. Great!! However, are all of these words really useful?
Stop Words
Question 4
Take a look at the result of tokenizing Tweet 13. What words here do not provide any information about the content of the review? Think about words that don’t really tell you anything about why the individual is writing this comment.
Words that convey no useful information are called stop words. These are different from language to language, but in English they include things like “a”, “the”, “it”, and “is”. There is a whole list of these contained in R. To load them (you need to do this!), use the function below:
# Load the list of English stop words
data(stop_words)
These words are necessary for writing, but they don’t tell us much about how the writer felt about the airline or the flight. So, let’s remove them!
Luckily, removing stop words can be done using one function:
# Pull the text of the 13th Tweet
<- Tweets$text[13]
textdata
# Convert to tidy form
<-tibble(textdata)
textdata
%>%
textdata unnest_tokens(word, textdata ) %>%
anti_join(stop_words)
Question 5
After removing the stop words, how many unique words do we have in the Tweet now?
Let’s practice!
Question 6
Look at Tweet 27 (row 27). Tokenize the comment and remove the stop words. Based on what you see, what do you think motivated this individual to write a comment?
Question 7
Stop words are not the only challenge in interpreting text data. Take a look at the result of Question 6. Based on what you see, comment on at least two other potential issues we need to figure out how to solve before analyzing text data.
Okay, so we can tokenize and remove stop words in individual comments. Now what?
Looking at all the words
Instead of looking Tweet by Tweet, let’s take a look at what types of
words we are seeing across the whole data set. To do this, we need to
tokenize the entire column text
in the data
set.
# Pull the whole column
<- Tweets$text
textdata
# Convert to tibble
<-tibble(textdata)
textdata
# Store all the words
<- textdata %>%
words_only unnest_tokens(word, textdata) %>%
anti_join(stop_words)
Running this code creates the object words_only
. Let’s
take a look at this object.
Question 8
How many words in total are there in the data once the stop words are removed?
That’s a lot of words!!! Are they all unique? Or do certain words come up more than once. Let’s see.
<- words_only %>%
unique_words count(word, sort = TRUE)
Running the code above creates a data frame with two columns.
Question 9
What do you think the column n
in the
unique_words
data frame represents?
Question 10
What are the top 6 most frequent words in our text data? Explain why it makes sense that these words occur often across our data set.
Visualizing the words
Instead of answering Question 10 by looking at the data set, we can also create a visualization. What kind of visualization can we make using words??
library(ggplot2)
# Start with the unique words data frame
%>%
unique_words # Choose only words that occur at least 600 times
filter(n > 600) %>%
# Order the words from most frequent to least frequent
mutate(word = reorder(word, n)) %>%
# Create a plot
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
Hey, it’s a bar graph! This is a familiar type of visualization, just applied to new data.
Question 11
Change the plot to include only words that occur at least 1000 times. Add a label to the x axis, title the plot, and the change the color of the bars to be a different color.
Hmm…okay, these words are the most frequent, but are they really useful to tell us about what kinds of comment are being made about flights?? Not so much.
Removing additional words
If we decide that there are some words that we do not want to consider, we can add them to our list of stop words. They are not really stop words, but for us, these are words that we do not want to consider for our analysis.
Recall that this is the code that we used to remove stop words:
# Store all the words
<- textdata %>%
words_only unnest_tokens(word, textdata ) %>%
anti_join(stop_words)
The part of this code that removes the words that we don’t want to
include is anti_join
. Let’s say that in addition to the
stop words, I want to exclude the word “united” from the data set.
# Store all the words
<- textdata %>%
words_only1 unnest_tokens(word, textdata ) %>%
anti_join(stop_words) %>%
filter( word != "united")
Question 12
Re-create your graph from Question 11 using the new
words_only1
list of words. Hint: You will need to create
unique_words1
before you can create your plot.
Question 13
What other words you might want to exclude?
If we have a list of words that we want to exclude, we want to clean up the code a little bit to make the filtering easier.
# Create the list of words to exclude
<- c("united", "flight")
to_exclude
# Store all the words
<- textdata %>%
words_only1 unnest_tokens(word, textdata ) %>%
anti_join(stop_words) %>%
filter( !(word %in% to_exclude))
Question 14
Continue this process until you create a plot with 10 words that you consider informative. In other words, you want 10 words that tell you something about why the individual was writing the Tweet.
Question 15
Based on what you see, what seems to be common topics that people are writing about?
At this point, there are two big things we can do with text data: (1) topic modeling and (2) sentiment analysis. Topic modeling groups together words that describe a similar reason for writing, such as complaining about a late flight. Sentiment analysis helps us determine if a Tweet contains mostly positive or mostly negative content, and what words are associated wit these positive or negative messages.
We will explore sentiment analysis next week!!
This
work was created by Nicole Dalzell and is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2023 February 21.
Citation
The data used in this lab came from:
Figure Eight. Twitter US Airline Sentiment. 2020. Kaggle. Data Set [.csv], accessed day February 20, 2023 from https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment.