library(tidyverse)
library(tidytext)
library(textdata)
library(wordcloud)
library(tm)
rm(list=ls())Text Analysis Exercises
This exercise provides an introduction to working with unstructured text data. We will apply basic text analysis techniques to Facebook and Twitter’s 2017 annual reports (10-Ks).
General housekeeping items
Let’s begin by opening libraries and clearing the environment:
Set your working directory:
setwd('C:/YOURWD')Text analysis on Facebook and Twitter 10-Ks
On the course site, you will find two text files containing raw text for Facebook and Twitter’s 2017 annual reports (10-Ks). These filings have been cleaned and parsed and are available from Bill McDonald’s website (check it out for more filings and text data!). Save each file into your working directory and import as follows:
facebook <- read_file('fb_2017_k.txt')
twitter <- read_file('tw_2017_k.txt')Next let’s load the stop words and sentiment dictionaries into our environment:
loughran <- get_sentiments('loughran')
stop <- stop_wordsTake some time to peruse the dictionaries above. As we discussed in class, these dictionaries, while useful, are not perfect and may not work in every case.
Let’s wrangle the text files into a data frame (tibble) and tokenize all words:
reports <- tibble('company' = c('facebook', 'twitter'), 'fiscal_year' = 2017, 'text' = c(facebook, twitter))
tidy_reports <- reports %>%
unnest_tokens(word, text, token = 'words')Note that the tokenizer removes punctuation and converts all characters to lowercase. Let’s go one step further and remove any values that contain numbers.
tidy_reports <- tidy_reports %>%
filter(!str_detect(word, '[0-9]+')) Now let’s remove stop words:
tidy_reports <- tidy_reports %>%
anti_join(stop, by = 'word')Next, let’s generate word counts and plot (Facebook and Twitter separately):
tidy_counts <- tidy_reports %>%
count(company, word) %>%
arrange(company, desc(n))
tidy_counts %>%
filter(company == 'facebook') %>%
top_n(10, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
coord_flip() +
labs(title = 'Facebook 10-K Word Counts', x = 'Word', y = 'Count') tidy_counts %>%
filter(company == 'twitter') %>%
top_n(10, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
coord_flip() +
labs(title = 'Twitter 10-K Word Counts', x = 'Word', y = 'Count') Now, let’s join the data with sentiment dictionaries and generate word counts by word/sentiment:
loughran_counts <- tidy_reports %>%
inner_join(loughran, by = 'word') %>%
count(company, sentiment, word, sort = TRUE) %>%
arrange(company, sentiment, desc(n))Now that we have word counts by sentiment, let’s plot:
sentiment_counts <- loughran_counts %>%
filter(sentiment == 'positive' | sentiment == 'negative') %>%
group_by(company, sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word = fct_reorder(word, n))
ggplot(sentiment_counts, aes(x = word, y = n, fill = sentiment)) +
geom_col() +
coord_flip() +
facet_wrap(company~sentiment , scales = 'free') +
labs(title = '10-K Sentiment Counts', x = 'Word', y = 'Count')Illustration of reshaping data into a document term matrix (two ways):
dtm1 <- tidy_counts %>%
cast_dtm(company, word, n) %>%
as.matrix() %>%
tibble()
dtm2 <- tidy_counts %>%
rename(company_name = company) %>%
pivot_wider(names_from = word, values_from = n)Finally, here is a demonstration of using a word cloud for positive and negative words (Facebook and Twitter separately):
set.seed(42)
wordcloud_data <- loughran_counts %>%
filter(company == 'facebook', sentiment == 'positive')
wordcloud(
words = wordcloud_data$word,
freq = wordcloud_data$n,
max.words = 30,
random.order = FALSE,
colors = brewer.pal(8, 'Dark2')
)wordcloud_data <- loughran_counts %>%
filter(company == 'twitter', sentiment == 'positive')
wordcloud(
words = wordcloud_data$word,
freq = wordcloud_data$n,
max.words = 30,
random.order = FALSE,
colors = brewer.pal(8, 'Dark2')
)For more practice and examples, check out Julia Silge and David Robinson’s introduction to tidytext.