Introduction

In this lesson, we will open text data and conduct some basic analyses. This lab will help you cultivate a deeper understanding of text methods we discussed in class, as well as some of the challenges associated with analyzing text data.

Getting started

Remember to start by setting your working directory to a new folder, and saving your script/markdown document in this folder. You should also store any data you need in that folder. Load and/or install packages at the top of your script.

Note: even if you set your working directory through the toolbar in R Studio (e.g. by clicking Session –> Set Working Directory…) you should still paste the executed code in your R Script or R Markdown document. This helps make your code replicable and ensures you know which folder on your computer you’re working out of.

Set working directory

setwd("/Users/shanaya/POL3325G/Lectures/Lecture 11")

Packages and data

library(rio)
library(tidyverse)
library(tidytext)
library(stringr)
library(ggplot2)
library(tm)
library(topicmodels)
library(widyr)
tr_tweets <- rio::import("trump_tweets.csv")
load("canada_party_platforms.RData")

Exploratory analysis of Trump’s tweets

The example below is based on an example created by Chris Bail from Duke University.

How many words are in this dataset of Donald Trump’s tweets?

tr_tweets <- rio::import("trump_tweets.csv")

# below I show you how to tidy the data such that there is one word per row
tidy_tr_tweets <- tr_tweets %>%
  select(created_at, text) %>%
  unnest_tokens("word", text)  

Above, we tokenize our data such that it is one-word/term-per-row. This is in keeping with the “tidy data” format (each variable is a column, observation is a row). This is not the only way to store text data, however, and often times, researchers will use document-term matrices. We will see an example of a DTM below.

# how often does each token appear? 
# unnest_tokens() "tokenizes" our text - split our text into smaller subunits (words) 
word_count <- tr_tweets %>%
  select(created_at, text) %>%
  unnest_tokens(output = "word", input = text) %>%
  count(word) %>% # count how many times each word appears
  arrange(desc(n))

head(word_count)
##   word     n
## 1  the 13373
## 2   to  8113
## 3 t.co  6584
## 4    a  6340
## 5  and  5966
## 6   is  5572

There are 362,237 words in this dataset. When we look at the top words, most of them are uninformative (e.g., “the”, “and”, “to”, “of”). This highlights why we often pre-process our text.

What are the commonly used words in Trump’s tweets?

data("stop_words") # import a stop words dataframe

# in the code below, we take several "pre-processing" steps 
tr_tweets_informative_words <- tidy_tr_tweets %>% 
  anti_join(stop_words, by="word") %>% # remove stop words using pre-determined stop word dictionary
  filter(!str_detect(word, "\\d")) %>% # remove any value in the word column that contains a digit
  count(word) %>%
        arrange(desc(n)) %>% 
  filter(!word %in% c("t.co", "https", "http", "amp")) # remove list of custom "stop words" 

head(tr_tweets_informative_words)
##      word    n
## 1   trump 1631
## 2  people 1067
## 3   obama 1048
## 4    time  777
## 5 america  743
## 6 country  574
tr_tweets_informative_words %>%
  slice_head(n=20) %>%
  ggplot(aes(x=word, y=n, fill=word))+
  geom_bar(stat="identity")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(y = "Frequency",
       x = "Word",
       title="Top 20 Words in Trump's Tweets")

Sentiment analysis of Trump’s tweets

Sentiment analysis is a type of dictionary analysis. We will use the “bing” sentiment dictionary included in the tidytext package. This is a simple dictionary with a long list of words associated with “positive” sentiment and a long list of words associated with “negative” sentiment. Sometimes sentiment dictionaries will include positive/negative sentiment scores for words. E.g. “great” = 2 while “good” = 1.

head(get_sentiments("bing"))
## # A tibble: 6 × 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative

What are the most common negative words Trump uses?

bing_word_counts <- tidy_tr_tweets %>% 
  inner_join(get_sentiments("bing"), by="word") %>%
  # inner_join: keep only obs. that exist in both x and y (tidy_tr_tweets and sentiment dictionary)
  count(word, sentiment) %>%
  arrange(sentiment, desc(n)) # add desc() around sentiment to sort by positive words instead

We are using the Bing et al. sentiment dictionary (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html).

bing_word_counts %>%
  group_by(sentiment) %>% 
  slice_max(n, n=15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill=sentiment)) +
  geom_col() +
  facet_wrap(vars(sentiment), scales="free") +
  labs(x="Word frequency", y=NULL)

“Trump” most likely refers to Donald Trump’s name rather than the verb (or common noun) “trump”, so we may want to go back and remove this word (like we did with other stopwords). It is pretty common to remove proper nouns but the decision to remove or keep proper nouns depends on the context.

We may be interested in removing the word “like” as well although we could extract tweets that contain the word “like”. We might also wonder how is the word “great” being used? If it is used in the context of “make American great again”, is it valid to assign the word “great” to “positive” sentiment? I’m undecided. This slogan is a hopeful message for supporters but it does imply a negative connotation, which is that America is “not” great right now. Food for thought!

How do positive and negative word appearances compare in 2012 vs. 2016?

bing_counts_short <- tidy_tr_tweets %>% 
  mutate(date = as.Date(created_at),
         year = year(date)) %>%
  select(-c(date)) %>%
  filter(year == 2012 | year == 2016) %>%
  inner_join(get_sentiments("bing"), by="word") %>%
  # inner_join: keep only obs. that exist in both x and y (tidy_tr_tweets and sentiment dictionary)
  count(word, sentiment, year) %>%
  arrange(year, sentiment, desc(n))


bing_counts_short %>%
  group_by(year, sentiment) %>% 
  slice_max(order_by = n, n=15) %>%
  ungroup() %>%
  ggplot(aes(n, word, fill=sentiment)) +
  geom_col() +
  facet_wrap(vars(sentiment, year), scales="free") +
  labs(x="Word frequency", y=NULL, title = "Frequency of positive and negative words in Trump's 2012 and 2016 tweets") +
  theme(legend.position = "none")

Topic modelling

A topic model is a type of statistical model used to identify “topics” or “themes” that underlie a collection of documents. It assumes that your documents are a mixture of topics, the number of which YOU specify, and it estimates the keywords that identify topics.

We learn two things:

  1. Learn what topics each document are about (documents are a mixture of topics)
  2. Learn what words best represent each topic (topics are a mixture of words)

We will not get into the nitty-gritty details of how exactly LDA (the type of topic model we are using) generates the topics.

Instead, think of topic modelling as a useful exploratory data analysis tool to help you understand the themes that underlie a set of documents.

Trump Tweets Example

Pre-processing text (tweets):

Below, we are completing some steps we did earlier (therefore our code is inefficient and redundant). Let’s start over, beginning with cleaning our tweets. The reason I’m re-doing some of the cleaning I did above is because I want to show you an example of a topic model from start-to-finish to reiterate that we need to reduce the “noise” in our text data by removing useless tokens.

word_count_bytweet <- tr_tweets %>%
  mutate(id_str = as.character(id_str)) %>% 
  select(id_str, text) %>%
  unnest_tokens("word", text)  %>% 
  anti_join(stop_words, by="word") %>%
  filter(!word %in% c("t.co", "https", "http", "amp", "bit.ly", "pm", "people", "trump", "tonight")) %>%
  filter(!str_detect(word, "\\d")) %>%
  group_by(id_str) %>%
  count(word) 

create DTM/DFM

Let’s put our data into one-term-per-document-per-row (instead of one-term-per-row). The topicmodels package requires our data to be structured as a document term matrix.

tweets_dtm <- word_count_bytweet %>%
  tidytext::cast_dtm(document = id_str, term = word, value = n)

str(tweets_dtm)
## List of 6
##  $ i       : int [1:145805] 1 4786 1 225 449 1040 1064 1119 1129 1655 ...
##  $ j       : int [1:145805] 1 1 2 2 2 2 2 2 2 2 ...
##  $ v       : num [1:145805] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 20483
##  $ ncol    : int 18888
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:20483] "10027087487" "100585254401748992" "100658594206322688" "100926910372126720" ...
##   ..$ Terms: chr [1:18888] "andrea" "beach" "bocelli" "donald" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

Run topic model

Now let’s run a topic model that returns 4 topics (k=4). We set the seed to make our example reproducible. And then we will select the top ten words for each topic to get a sense of what the topics are about.

tweets_lda <- LDA(tweets_dtm, k=3, control=list(seed=123))

top_terms_lda <- tidy(tweets_lda) %>%
  group_by(topic) %>%
  slice_max(beta, n=20) %>%
  ungroup() %>%
  arrange(topic, -beta)

head(top_terms_lda)
## # A tibble: 6 × 3
##   topic term           beta
##   <int> <chr>         <dbl>
## 1     1 time        0.00835
## 2     1 president   0.00565
## 3     1 world       0.00545
## 4     1 barackobama 0.00523
## 5     1 foxnews     0.00468
## 6     1 watch       0.00441

The beta coefficients tell us the probability of a term being generated from a topic.

top_terms_lda %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta)) +
  geom_col() +
  scale_x_reordered() +
  facet_wrap(vars(topic), scales = "free_x") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Which tweets are most associated with topic #1?

We can look at the gamma values, which tell us how much each document is associated with each topic.

# extract gamma values (document-topic probabilities)
tweet_topic_probs <- tidy(tweets_lda, matrix = "gamma")

tweet_topic1_probs <- tweet_topic_probs %>%
  filter(topic == 1) %>%
  arrange(desc(gamma)) %>%
  mutate(gamma = as.numeric(gamma)) %>% 
  filter(gamma >= 0.35)
top_tweet <- tr_tweets %>%
  filter(str_starts(id_str, "68945839"))
top_tweet$text
## [1] "Dope Frank Bruni said I called many people, including Karl Rove, losers-true! I never called my friend @HowardStern a loser- he’s a winner!"
tr_tweets %>%
  filter(id_str %in% tweet_topic1_probs$document) %>%
  slice_head(n=5)
##   rownames             source             id_str
## 1      112 Twitter Web Client        17580961111
## 2      162 Twitter Web Client        11701179384
## 3      171 Twitter Web Client        10625708767
## 4      352          TweetDeck 141984050075275265
## 5      647 Twitter Web Client 113364823579963392
##                                                                                                                                           text
## 1 Due to popular demand, CNN will re-broadcast  the Larry King Live show I hosted in June, in which I interview Larry. Monday July 5, 9 pm CNN
## 2   Then we attended the Scottish fashion show that benefits veterans, Dressed to Kilt 2010, which I co-hosted with Sir Sean and Lady Connery.
## 3       Olympic Gold  Medalist Evan Lysacek just left my office. He is in town and wanted to meet me--he's a fanastic guy and a true champion.
## 4  Fitch has downgraded our credit outlook to ""negative."" Why? @BarackObama's failure to lead with the Super Committee. http://t.co/Uxyhhq6y
## 5     The World Economic Forum now ranks the US the fifth most competitive economy in the world. We have fallen from first under @BarackObama.
##            created_at retweet_count in_reply_to_user_id_str favorite_count
## 1 2010-07-02 15:27:17            22                    <NA>              8
## 2 2010-04-06 14:34:20             8                    <NA>              2
## 3 2010-03-17 15:14:13            13                    <NA>              7
## 4 2011-11-30 20:56:50            64                    <NA>              7
## 5 2011-09-12 21:34:14           310                    <NA>             12
##   is_retweet
## 1      FALSE
## 2      FALSE
## 3      FALSE
## 4      FALSE
## 5      FALSE

Federal Party Platforms Example

Pre-processing text (platforms):

summary(canada_party_platforms)
##        ID          platform              year          text          
##  Min.   :    1   Length:10485       Min.   :2004   Length:10485      
##  1st Qu.: 2712   Class :character   1st Qu.:2006   Class :character  
##  Median : 5350   Mode  :character   Median :2008   Mode  :character  
##  Mean   : 5367                      Mean   :2009                     
##  3rd Qu.: 8057                      3rd Qu.:2015                     
##  Max.   :10707                      Max.   :2015
table(canada_party_platforms$platform) # how many sentences in each unique platform? 
## 
##  62110_200810_GR.csv  62110_201510_GR.csv 62320_200406_NDP.csv 
##                 1867                  511                  491 
## 62320_200601_NDP.csv 62320_200810_NDP.csv 62320_201510_NDP.csv 
##                  643                  664                  886 
## 62420_200406_LIB.csv 62420_200601_LIB.csv 62420_200810_LIB.csv 
##                  598                 1086                  322 
## 62420_201510_LIB.csv 62623_200406_CON.csv 62623_200601_CON.csv 
##                  868                  458                  525 
## 62623_200810_CON.csv 62623_201510_CON.csv 
##                  270                 1296
df_party_platforms <- canada_party_platforms %>% 
   mutate(party = stringr::str_extract(platform, "(?<=_)[^_]+(?=\\.csv$)")) 
# above we use regular expression to extract the party code from the platform variable (stored at end of the string)
# you will never be tested on regular expression, but if you are interested in learning about regex, you can read more about it here: 
# https://stringr.tidyverse.org/articles/regular-expressions.html 
custom_dictionary <- data.frame(word = c("canada's", "canada",  "canadians", "canadian", "canadian's", "canada's", "elected", "government", "gw", "federal", "liberal", "conservative", "ndp", "stephen", "harper", "jack", "layton"))

tidy_platform_df <- df_party_platforms %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by="word") %>% # remove stopwords (e.g. "and", "the" etc.)
  anti_join(custom_dictionary, by="word") %>% # a custom stop word we want to remove
  group_by(platform) %>% 
  count(word, sort=TRUE) %>%
  ungroup()


platforms_dtm <- tidy_platform_df %>%
  cast_dtm(document = platform, term= word, value=n)
# our document identifier is the variable called "platform"
# Our terms are stored in the "word" column
# the column called "n" tells us word frequency
# DTM: each row is the party-year platform text

platforms_dtm
## <<DocumentTermMatrix (documents: 14, terms: 9455)>>
## Non-/sparse entries: 31585/100785
## Sparsity           : 76%
## Maximal term length: 28
## Weighting          : term frequency (tf)

LDA Topic Model

platforms_lda <- LDA(platforms_dtm, k=10, control=list(seed=123))

top_terms_lda <- tidy(platforms_lda) %>%
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%  # Sort by topic and beta in descending order
  slice_head(n = 15) %>%
  arrange(topic, -beta)

top_terms_lda
## # A tibble: 150 × 3
## # Groups:   topic [10]
##    topic term              beta
##    <int> <chr>            <dbl>
##  1     1 hundred        0.00979
##  2     1 million        0.00799
##  3     1 support        0.00790
##  4     1 ensure         0.00742
##  5     1 communities    0.00712
##  6     1 invest         0.00669
##  7     1 families       0.00586
##  8     1 national       0.00571
##  9     1 veterans       0.00570
## 10     1 infrastructure 0.00555
## # ℹ 140 more rows
top_terms_lda %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta)) +
  geom_col() +
  scale_x_reordered() +
  facet_wrap(vars(topic), scales = "free_x") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The topics are hard to understand from this plot alone. We might adjust k (number of topics) and look at longer lists of words associated with each topic. For example, below we look at the top words associated with topic #2. It seems to be about climate change - e.g. policy, effects of climate change.

tidy(platforms_lda) %>%
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%  
  filter(topic == 2)
## # A tibble: 9,455 × 3
## # Groups:   topic [1]
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     2 green    0.0124 
##  2     2 party    0.00954
##  3     2 health   0.00755
##  4     2 tax      0.00536
##  5     2 water    0.00529
##  6     2 energy   0.00498
##  7     2 support  0.00451
##  8     2 ensure   0.00438
##  9     2 provide  0.00412
## 10     2 canada’s 0.00404
## # ℹ 9,445 more rows

Which platforms are most associated with topic #2?

# extract gamma values (document-topic probabilities)
platform_topic_probs <- tidy(platforms_lda, matrix = "gamma")

platform_topic_probs %>%
  filter(topic==2) %>%
  arrange(desc(gamma))
## # A tibble: 14 × 3
##    document             topic      gamma
##    <chr>                <int>      <dbl>
##  1 62110_200810_GR.csv      2 0.985     
##  2 62420_200810_LIB.csv     2 0.00000407
##  3 62623_200810_CON.csv     2 0.00000407
##  4 62623_200406_CON.csv     2 0.00000272
##  5 62110_201510_GR.csv      2 0.00000258
##  6 62623_200601_CON.csv     2 0.00000229
##  7 62420_200406_LIB.csv     2 0.00000207
##  8 62320_200810_NDP.csv     2 0.00000200
##  9 62320_200601_NDP.csv     2 0.00000183
## 10 62320_200406_NDP.csv     2 0.00000180
## 11 62420_201510_LIB.csv     2 0.00000169
## 12 62320_201510_NDP.csv     2 0.00000152
## 13 62623_201510_CON.csv     2 0.00000103
## 14 62420_200601_LIB.csv     2 0.00000101

The Green’s 2008 platform is most highly associated with this topic. We would likely conclude that the topic is indeed about climate change. We might continue to change the number of topics, however, to try to get more semantically useful topics.

Text Similarity

The general idea behind computing cosine similarity is this: We have two documents (unitized text at some level - could be sentences, paragraphs, books, etc.). We want to know, overall, how similar are the documents based on the words that appear in both documents?

If two documents have similar words in similar proportions, cosine similarity will be close to 1. If the two documents are very different e.g. no shared words, cosine similarity will be closer to 0.

While we might be able to read four sentences and determine that sentences 1,2, and 3 are quite similar, while sentence 4 is clearly different, how could we use statistics or math to do this for us using a large amount of sentences?

  1. I love ice cream!! It is so tasty.
  2. Ice cream is really tasty, I like it!
  3. Ice cream is so tasty, I love it.
  4. Ice cream sucks, I hate the taste.

Using our tidied text data which includes counts for each word in each platform, we can use the appearance of words in platforms to calculate the cosine similarity of documents.

Federal Party Platforms Example

cosine_similarity <- tidy_platform_df %>%
  pairwise_similarity(platform, word, n) %>% # from widyr package
  arrange(desc(similarity))

cosine_similarity 
## # A tibble: 182 × 3
##    item1                item2                similarity
##    <chr>                <chr>                     <dbl>
##  1 62420_200406_LIB.csv 62420_200601_LIB.csv      0.805
##  2 62420_200601_LIB.csv 62420_200406_LIB.csv      0.805
##  3 62420_201510_LIB.csv 62320_201510_NDP.csv      0.786
##  4 62320_201510_NDP.csv 62420_201510_LIB.csv      0.786
##  5 62320_200406_NDP.csv 62320_200601_NDP.csv      0.766
##  6 62320_200601_NDP.csv 62320_200406_NDP.csv      0.766
##  7 62320_200601_NDP.csv 62420_200601_LIB.csv      0.736
##  8 62420_200601_LIB.csv 62320_200601_NDP.csv      0.736
##  9 62320_200810_NDP.csv 62320_200601_NDP.csv      0.734
## 10 62320_200601_NDP.csv 62320_200810_NDP.csv      0.734
## # ℹ 172 more rows

Of the pairs of documents that scored above 70% in similarity, how many were comparisons of the same parties but in different election years?

cosine_similarity %>%
  mutate(
    party1 = str_extract(item1, "(?<=_)[A-Z]+(?=\\.csv)"),
    party2 = str_extract(item2, "(?<=_)[A-Z]+(?=\\.csv)")
  ) %>%
  filter(similarity > 0.70) %>%
  summarize(
    total_above_70 = n(),
    same_party_n = sum(party1 == party2),
    same_party_prop = same_party_n/total_above_70)
## # A tibble: 1 × 3
##   total_above_70 same_party_n same_party_prop
##            <int>        <int>           <dbl>
## 1             34           16           0.471

Wrapping up:

We covered a lot of new packages and functions in this class. The good news is that you do not need to implement “text mining” in any of the written assignments. For that reason, I am not providing a list of functions/packages like I usually do at the end of our labs.

We also did not cover how to validate these methods - if you are interested in implementing these methods in the future, be sure to read about the different ways that you can validate (e.g. assess semantic coherence of topics in topic model).

While I do not expect you to remember the new functions and packages we covered in today’s lab, you should, however, have a general grasp of the key text analysis methods we discussed: dictionary analysis (sentiment analysis being the example covered in this lab), topic modelling, and text similarity. When are they used? How do they work (in general terms)?