In this lesson, we will open text data and conduct some basic analyses. This lab will help you cultivate a deeper understanding of text methods we discussed in class, as well as some of the challenges associated with analyzing text data.
Remember to start by setting your working directory to a new folder, and saving your script/markdown document in this folder. You should also store any data you need in that folder. Load and/or install packages at the top of your script.
Note: even if you set your working directory through the toolbar in R Studio (e.g. by clicking Session –> Set Working Directory…) you should still paste the executed code in your R Script or R Markdown document. This helps make your code replicable and ensures you know which folder on your computer you’re working out of.
setwd("/Users/shanaya/POL3325G/Lectures/Lecture 11")
library(rio)
library(tidyverse)
library(tidytext)
library(stringr)
library(ggplot2)
library(tm)
library(topicmodels)
library(widyr)
tr_tweets <- rio::import("trump_tweets.csv")
load("canada_party_platforms.RData")
The example below is based on an example created by Chris Bail from Duke University.
tr_tweets <- rio::import("trump_tweets.csv")
# below I show you how to tidy the data such that there is one word per row
tidy_tr_tweets <- tr_tweets %>%
select(created_at, text) %>%
unnest_tokens("word", text)
Above, we tokenize our data such that it is one-word/term-per-row. This is in keeping with the “tidy data” format (each variable is a column, observation is a row). This is not the only way to store text data, however, and often times, researchers will use document-term matrices. We will see an example of a DTM below.
# how often does each token appear?
# unnest_tokens() "tokenizes" our text - split our text into smaller subunits (words)
word_count <- tr_tweets %>%
select(created_at, text) %>%
unnest_tokens(output = "word", input = text) %>%
count(word) %>% # count how many times each word appears
arrange(desc(n))
head(word_count)
## word n
## 1 the 13373
## 2 to 8113
## 3 t.co 6584
## 4 a 6340
## 5 and 5966
## 6 is 5572
There are 362,237 words in this dataset. When we look at the top words, most of them are uninformative (e.g., “the”, “and”, “to”, “of”). This highlights why we often pre-process our text.
data("stop_words") # import a stop words dataframe
# in the code below, we take several "pre-processing" steps
tr_tweets_informative_words <- tidy_tr_tweets %>%
anti_join(stop_words, by="word") %>% # remove stop words using pre-determined stop word dictionary
filter(!str_detect(word, "\\d")) %>% # remove any value in the word column that contains a digit
count(word) %>%
arrange(desc(n)) %>%
filter(!word %in% c("t.co", "https", "http", "amp")) # remove list of custom "stop words"
head(tr_tweets_informative_words)
## word n
## 1 trump 1631
## 2 people 1067
## 3 obama 1048
## 4 time 777
## 5 america 743
## 6 country 574
tr_tweets_informative_words %>%
slice_head(n=20) %>%
ggplot(aes(x=word, y=n, fill=word))+
geom_bar(stat="identity")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
labs(y = "Frequency",
x = "Word",
title="Top 20 Words in Trump's Tweets")
Sentiment analysis is a type of dictionary analysis. We will use the
“bing” sentiment dictionary included in the tidytext
package. This is a simple dictionary with a long list of words
associated with “positive” sentiment and a long list of words associated
with “negative” sentiment. Sometimes sentiment dictionaries will include
positive/negative sentiment scores for words. E.g. “great” = 2
while “good” = 1.
head(get_sentiments("bing"))
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
bing_word_counts <- tidy_tr_tweets %>%
inner_join(get_sentiments("bing"), by="word") %>%
# inner_join: keep only obs. that exist in both x and y (tidy_tr_tweets and sentiment dictionary)
count(word, sentiment) %>%
arrange(sentiment, desc(n)) # add desc() around sentiment to sort by positive words instead
We are using the Bing et al. sentiment dictionary (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html).
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n=15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill=sentiment)) +
geom_col() +
facet_wrap(vars(sentiment), scales="free") +
labs(x="Word frequency", y=NULL)
“Trump” most likely refers to Donald Trump’s name rather than the verb (or common noun) “trump”, so we may want to go back and remove this word (like we did with other stopwords). It is pretty common to remove proper nouns but the decision to remove or keep proper nouns depends on the context.
We may be interested in removing the word “like” as well although we could extract tweets that contain the word “like”. We might also wonder how is the word “great” being used? If it is used in the context of “make American great again”, is it valid to assign the word “great” to “positive” sentiment? I’m undecided. This slogan is a hopeful message for supporters but it does imply a negative connotation, which is that America is “not” great right now. Food for thought!
bing_counts_short <- tidy_tr_tweets %>%
mutate(date = as.Date(created_at),
year = year(date)) %>%
select(-c(date)) %>%
filter(year == 2012 | year == 2016) %>%
inner_join(get_sentiments("bing"), by="word") %>%
# inner_join: keep only obs. that exist in both x and y (tidy_tr_tweets and sentiment dictionary)
count(word, sentiment, year) %>%
arrange(year, sentiment, desc(n))
bing_counts_short %>%
group_by(year, sentiment) %>%
slice_max(order_by = n, n=15) %>%
ungroup() %>%
ggplot(aes(n, word, fill=sentiment)) +
geom_col() +
facet_wrap(vars(sentiment, year), scales="free") +
labs(x="Word frequency", y=NULL, title = "Frequency of positive and negative words in Trump's 2012 and 2016 tweets") +
theme(legend.position = "none")
A topic model is a type of statistical model used to identify “topics” or “themes” that underlie a collection of documents. It assumes that your documents are a mixture of topics, the number of which YOU specify, and it estimates the keywords that identify topics.
We learn two things:
We will not get into the nitty-gritty details of how exactly LDA (the type of topic model we are using) generates the topics.
Instead, think of topic modelling as a useful exploratory data analysis tool to help you understand the themes that underlie a set of documents.
Below, we are completing some steps we did earlier (therefore our code is inefficient and redundant). Let’s start over, beginning with cleaning our tweets. The reason I’m re-doing some of the cleaning I did above is because I want to show you an example of a topic model from start-to-finish to reiterate that we need to reduce the “noise” in our text data by removing useless tokens.
word_count_bytweet <- tr_tweets %>%
mutate(id_str = as.character(id_str)) %>%
select(id_str, text) %>%
unnest_tokens("word", text) %>%
anti_join(stop_words, by="word") %>%
filter(!word %in% c("t.co", "https", "http", "amp", "bit.ly", "pm", "people", "trump", "tonight")) %>%
filter(!str_detect(word, "\\d")) %>%
group_by(id_str) %>%
count(word)
Let’s put our data into one-term-per-document-per-row (instead of
one-term-per-row). The topicmodels
package requires our
data to be structured as a document term matrix.
tweets_dtm <- word_count_bytweet %>%
tidytext::cast_dtm(document = id_str, term = word, value = n)
str(tweets_dtm)
## List of 6
## $ i : int [1:145805] 1 4786 1 225 449 1040 1064 1119 1129 1655 ...
## $ j : int [1:145805] 1 1 2 2 2 2 2 2 2 2 ...
## $ v : num [1:145805] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 20483
## $ ncol : int 18888
## $ dimnames:List of 2
## ..$ Docs : chr [1:20483] "10027087487" "100585254401748992" "100658594206322688" "100926910372126720" ...
## ..$ Terms: chr [1:18888] "andrea" "beach" "bocelli" "donald" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
Now let’s run a topic model that returns 4 topics (k=4). We set the seed to make our example reproducible. And then we will select the top ten words for each topic to get a sense of what the topics are about.
tweets_lda <- LDA(tweets_dtm, k=3, control=list(seed=123))
top_terms_lda <- tidy(tweets_lda) %>%
group_by(topic) %>%
slice_max(beta, n=20) %>%
ungroup() %>%
arrange(topic, -beta)
head(top_terms_lda)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 time 0.00835
## 2 1 president 0.00565
## 3 1 world 0.00545
## 4 1 barackobama 0.00523
## 5 1 foxnews 0.00468
## 6 1 watch 0.00441
The beta coefficients tell us the probability of a term being generated from a topic.
top_terms_lda %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta)) +
geom_col() +
scale_x_reordered() +
facet_wrap(vars(topic), scales = "free_x") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Which tweets are most associated with topic #1?
We can look at the gamma values, which tell us how much each document is associated with each topic.
# extract gamma values (document-topic probabilities)
tweet_topic_probs <- tidy(tweets_lda, matrix = "gamma")
tweet_topic1_probs <- tweet_topic_probs %>%
filter(topic == 1) %>%
arrange(desc(gamma)) %>%
mutate(gamma = as.numeric(gamma)) %>%
filter(gamma >= 0.35)
top_tweet <- tr_tweets %>%
filter(str_starts(id_str, "68945839"))
top_tweet$text
## [1] "Dope Frank Bruni said I called many people, including Karl Rove, losers-true! I never called my friend @HowardStern a loser- he’s a winner!"
tr_tweets %>%
filter(id_str %in% tweet_topic1_probs$document) %>%
slice_head(n=5)
## rownames source id_str
## 1 112 Twitter Web Client 17580961111
## 2 162 Twitter Web Client 11701179384
## 3 171 Twitter Web Client 10625708767
## 4 352 TweetDeck 141984050075275265
## 5 647 Twitter Web Client 113364823579963392
## text
## 1 Due to popular demand, CNN will re-broadcast the Larry King Live show I hosted in June, in which I interview Larry. Monday July 5, 9 pm CNN
## 2 Then we attended the Scottish fashion show that benefits veterans, Dressed to Kilt 2010, which I co-hosted with Sir Sean and Lady Connery.
## 3 Olympic Gold Medalist Evan Lysacek just left my office. He is in town and wanted to meet me--he's a fanastic guy and a true champion.
## 4 Fitch has downgraded our credit outlook to ""negative."" Why? @BarackObama's failure to lead with the Super Committee. http://t.co/Uxyhhq6y
## 5 The World Economic Forum now ranks the US the fifth most competitive economy in the world. We have fallen from first under @BarackObama.
## created_at retweet_count in_reply_to_user_id_str favorite_count
## 1 2010-07-02 15:27:17 22 <NA> 8
## 2 2010-04-06 14:34:20 8 <NA> 2
## 3 2010-03-17 15:14:13 13 <NA> 7
## 4 2011-11-30 20:56:50 64 <NA> 7
## 5 2011-09-12 21:34:14 310 <NA> 12
## is_retweet
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
summary(canada_party_platforms)
## ID platform year text
## Min. : 1 Length:10485 Min. :2004 Length:10485
## 1st Qu.: 2712 Class :character 1st Qu.:2006 Class :character
## Median : 5350 Mode :character Median :2008 Mode :character
## Mean : 5367 Mean :2009
## 3rd Qu.: 8057 3rd Qu.:2015
## Max. :10707 Max. :2015
table(canada_party_platforms$platform) # how many sentences in each unique platform?
##
## 62110_200810_GR.csv 62110_201510_GR.csv 62320_200406_NDP.csv
## 1867 511 491
## 62320_200601_NDP.csv 62320_200810_NDP.csv 62320_201510_NDP.csv
## 643 664 886
## 62420_200406_LIB.csv 62420_200601_LIB.csv 62420_200810_LIB.csv
## 598 1086 322
## 62420_201510_LIB.csv 62623_200406_CON.csv 62623_200601_CON.csv
## 868 458 525
## 62623_200810_CON.csv 62623_201510_CON.csv
## 270 1296
df_party_platforms <- canada_party_platforms %>%
mutate(party = stringr::str_extract(platform, "(?<=_)[^_]+(?=\\.csv$)"))
# above we use regular expression to extract the party code from the platform variable (stored at end of the string)
# you will never be tested on regular expression, but if you are interested in learning about regex, you can read more about it here:
# https://stringr.tidyverse.org/articles/regular-expressions.html
custom_dictionary <- data.frame(word = c("canada's", "canada", "canadians", "canadian", "canadian's", "canada's", "elected", "government", "gw", "federal", "liberal", "conservative", "ndp", "stephen", "harper", "jack", "layton"))
tidy_platform_df <- df_party_platforms %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by="word") %>% # remove stopwords (e.g. "and", "the" etc.)
anti_join(custom_dictionary, by="word") %>% # a custom stop word we want to remove
group_by(platform) %>%
count(word, sort=TRUE) %>%
ungroup()
platforms_dtm <- tidy_platform_df %>%
cast_dtm(document = platform, term= word, value=n)
# our document identifier is the variable called "platform"
# Our terms are stored in the "word" column
# the column called "n" tells us word frequency
# DTM: each row is the party-year platform text
platforms_dtm
## <<DocumentTermMatrix (documents: 14, terms: 9455)>>
## Non-/sparse entries: 31585/100785
## Sparsity : 76%
## Maximal term length: 28
## Weighting : term frequency (tf)
platforms_lda <- LDA(platforms_dtm, k=10, control=list(seed=123))
top_terms_lda <- tidy(platforms_lda) %>%
group_by(topic) %>%
arrange(topic, desc(beta)) %>% # Sort by topic and beta in descending order
slice_head(n = 15) %>%
arrange(topic, -beta)
top_terms_lda
## # A tibble: 150 × 3
## # Groups: topic [10]
## topic term beta
## <int> <chr> <dbl>
## 1 1 hundred 0.00979
## 2 1 million 0.00799
## 3 1 support 0.00790
## 4 1 ensure 0.00742
## 5 1 communities 0.00712
## 6 1 invest 0.00669
## 7 1 families 0.00586
## 8 1 national 0.00571
## 9 1 veterans 0.00570
## 10 1 infrastructure 0.00555
## # ℹ 140 more rows
top_terms_lda %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta)) +
geom_col() +
scale_x_reordered() +
facet_wrap(vars(topic), scales = "free_x") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The topics are hard to understand from this plot alone. We might adjust k (number of topics) and look at longer lists of words associated with each topic. For example, below we look at the top words associated with topic #2. It seems to be about climate change - e.g. policy, effects of climate change.
tidy(platforms_lda) %>%
group_by(topic) %>%
arrange(topic, desc(beta)) %>%
filter(topic == 2)
## # A tibble: 9,455 × 3
## # Groups: topic [1]
## topic term beta
## <int> <chr> <dbl>
## 1 2 green 0.0124
## 2 2 party 0.00954
## 3 2 health 0.00755
## 4 2 tax 0.00536
## 5 2 water 0.00529
## 6 2 energy 0.00498
## 7 2 support 0.00451
## 8 2 ensure 0.00438
## 9 2 provide 0.00412
## 10 2 canada’s 0.00404
## # ℹ 9,445 more rows
Which platforms are most associated with topic #2?
# extract gamma values (document-topic probabilities)
platform_topic_probs <- tidy(platforms_lda, matrix = "gamma")
platform_topic_probs %>%
filter(topic==2) %>%
arrange(desc(gamma))
## # A tibble: 14 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 62110_200810_GR.csv 2 0.985
## 2 62420_200810_LIB.csv 2 0.00000407
## 3 62623_200810_CON.csv 2 0.00000407
## 4 62623_200406_CON.csv 2 0.00000272
## 5 62110_201510_GR.csv 2 0.00000258
## 6 62623_200601_CON.csv 2 0.00000229
## 7 62420_200406_LIB.csv 2 0.00000207
## 8 62320_200810_NDP.csv 2 0.00000200
## 9 62320_200601_NDP.csv 2 0.00000183
## 10 62320_200406_NDP.csv 2 0.00000180
## 11 62420_201510_LIB.csv 2 0.00000169
## 12 62320_201510_NDP.csv 2 0.00000152
## 13 62623_201510_CON.csv 2 0.00000103
## 14 62420_200601_LIB.csv 2 0.00000101
The Green’s 2008 platform is most highly associated with this topic. We would likely conclude that the topic is indeed about climate change. We might continue to change the number of topics, however, to try to get more semantically useful topics.
The general idea behind computing cosine similarity is this: We have two documents (unitized text at some level - could be sentences, paragraphs, books, etc.). We want to know, overall, how similar are the documents based on the words that appear in both documents?
If two documents have similar words in similar proportions, cosine similarity will be close to 1. If the two documents are very different e.g. no shared words, cosine similarity will be closer to 0.
While we might be able to read four sentences and determine that sentences 1,2, and 3 are quite similar, while sentence 4 is clearly different, how could we use statistics or math to do this for us using a large amount of sentences?
Using our tidied text data which includes counts for each word in each platform, we can use the appearance of words in platforms to calculate the cosine similarity of documents.
cosine_similarity <- tidy_platform_df %>%
pairwise_similarity(platform, word, n) %>% # from widyr package
arrange(desc(similarity))
cosine_similarity
## # A tibble: 182 × 3
## item1 item2 similarity
## <chr> <chr> <dbl>
## 1 62420_200406_LIB.csv 62420_200601_LIB.csv 0.805
## 2 62420_200601_LIB.csv 62420_200406_LIB.csv 0.805
## 3 62420_201510_LIB.csv 62320_201510_NDP.csv 0.786
## 4 62320_201510_NDP.csv 62420_201510_LIB.csv 0.786
## 5 62320_200406_NDP.csv 62320_200601_NDP.csv 0.766
## 6 62320_200601_NDP.csv 62320_200406_NDP.csv 0.766
## 7 62320_200601_NDP.csv 62420_200601_LIB.csv 0.736
## 8 62420_200601_LIB.csv 62320_200601_NDP.csv 0.736
## 9 62320_200810_NDP.csv 62320_200601_NDP.csv 0.734
## 10 62320_200601_NDP.csv 62320_200810_NDP.csv 0.734
## # ℹ 172 more rows
Of the pairs of documents that scored above 70% in similarity, how many were comparisons of the same parties but in different election years?
cosine_similarity %>%
mutate(
party1 = str_extract(item1, "(?<=_)[A-Z]+(?=\\.csv)"),
party2 = str_extract(item2, "(?<=_)[A-Z]+(?=\\.csv)")
) %>%
filter(similarity > 0.70) %>%
summarize(
total_above_70 = n(),
same_party_n = sum(party1 == party2),
same_party_prop = same_party_n/total_above_70)
## # A tibble: 1 × 3
## total_above_70 same_party_n same_party_prop
## <int> <int> <dbl>
## 1 34 16 0.471
We covered a lot of new packages and functions in this class. The good news is that you do not need to implement “text mining” in any of the written assignments. For that reason, I am not providing a list of functions/packages like I usually do at the end of our labs.
We also did not cover how to validate these methods - if you are interested in implementing these methods in the future, be sure to read about the different ways that you can validate (e.g. assess semantic coherence of topics in topic model).
While I do not expect you to remember the new functions and packages we covered in today’s lab, you should, however, have a general grasp of the key text analysis methods we discussed: dictionary analysis (sentiment analysis being the example covered in this lab), topic modelling, and text similarity. When are they used? How do they work (in general terms)?