Today we are going to conclude our lab sessions (except for a full day of open lab) by talking about two more advanced topics for analyzing Twitter data (and text data more generally). Sentiment analysis and topic models.
Sentiment analysis (also sometimes called opinion mining or emotion AI) is a natural language processing technique to systematically quantify affective states represented by text.
To get started, let’s grab some Tweets using a function from Lab 3. In particular, we are going to compare the most recent 1000 Tweets from AOC’s personal account, AOC’s official account, and MTG’s official account. You won’t be able to run this code on your own without Developer access, but here is how the data was collected:
wd <- "D:/Twitter"
setwd(wd)
twitter_info <- read.csv("twitter_info2.csv",
stringsAsFactors = F)
last_n_tweets(bearer_token = twitter_info$bearer_token,
user_id = "AOC",
n = 1000,
tweet_fields = c("created_at",
"public_metrics",
"source")) -> AOC
last_n_tweets(bearer_token = twitter_info$bearer_token,
user_id = "RepAOC",
n = 1000,
tweet_fields = c("created_at",
"public_metrics",
"source")) -> RepAOC
last_n_tweets(bearer_token = twitter_info$bearer_token,
user_id = "RepMTG",
n = 1000,
tweet_fields = c("created_at",
"public_metrics",
"source")) -> RepMTG
saveRDS(AOC,"AOC.RDS")
saveRDS(RepAOC, "RepAOC.RDS")
saveRDS(RepMTG, "RepMTG.RDS")
Since I have already collected these for you, we can just read them in and load the required packages for today:
library(dplyr)
library(plyr)
library(stringdist)
AOC <- readRDS(url("https://www.dropbox.com/s/ueamyw0vg85lglp/AOC.RDS?dl=1"))
RepAOC <- readRDS(url("https://www.dropbox.com/s/fq1ug2v2ern9yzz/RepAOC.RDS?dl=1"))
RepMTG <- readRDS(url("https://www.dropbox.com/s/atkz3ib6dm9vhwy/RepMTG.RDS?dl=1"))
Let’s start with AOC’s personal account and take a quick look at what we grabbed:
head(AOC$data.text,10)
## [1] "RT @thedailybeast: EXCLUSIVE: Twitter is failing to remove 99 percent of hate speech posted by Twitter Blue users, new research has found,…"
## [2] "RT @Acyn: McCarthy: We might have a child that has no job, no dependents but sitting on the couch, we’re going to encourage that person to…"
## [3] "FYI there’s a fake account on here impersonating me and going viral. The Twitter CEO has engaged it, boosting visibility.\n\nIt is releasing false policy statements and gaining spread.\n\nI am assessing with my team how to move forward. In the meantime, be careful of what you see."
## [4] "RT @oneunderscore__: This quote that Elon Musk retweeted is not actually from Voltaire, it's from neo-Nazi and white supremacist Kevin Stro…"
## [5] "RT @JudiciaryDems: Let's be clear:\n \n-Harlan Crow does not have the authority to claim separation of powers to withhold information\n-We are…"
## [6] "Thank you to everyone who came to our Memorial Day weekend town hall! We had such a great crowd. 🤗\n\nIt was awesome answering all your questions, discussing the debt limit negotiations, and more.\n\nSee you next month! https://t.co/XdAe13wzSQ"
## [7] "RT @IndivisibleTeam: If you missed our phonebank with @AOC: \"Contrary to what Kevin McCarthy says, the debt limit is not about raising our…"
## [8] "RT @JoeBiden: Defaulting on the debt—including trillions incurred under Donald Trump—could mean seniors missing Social Security checks and…"
## [9] "RT @NoLieWithBTC: If you have student loan debt, you should know this:\n\nEvery single House Republican just voted to overturn President Bide…"
## [10] "RT @jackiekcalmes: Little-noted this wk: While McCarthy was all budget-cutting bravado in debt-limit talks with Biden, the House Appropriat…"
It is fairly clear that some of these Tweets have a “positive” connotation whereas others have a “negative” connotation. How might we systematically identify, across a large corpus, what sort of sentiment each of these texts conveys?
The basic idea underlying sentiment analysis is the following. For each of the texts in our corpus we want to…
A number of approaches to sentiment analysis are contained in the appropriately named SentimentAnalysis package and is quite straightforward to run after a bit of cleaning to get rid of those pesky emojis.
library(SentimentAnalysis)
AOC$data.text <- gsub('[^\x01-\x7F]', '', AOC$data.text)
sentiments <- analyzeSentiment(AOC$data.text)
summary(sentiments)
## WordCount SentimentGI NegativityGI PositivityGI
## Min. : 0.00 Min. :-0.52000 Min. :0.00000 Min. :0.00000
## 1st Qu.:10.00 1st Qu.:-0.03846 1st Qu.:0.00000 1st Qu.:0.07143
## Median :13.00 Median : 0.00000 Median :0.09091 Median :0.14907
## Mean :14.16 Mean : 0.04405 Mean :0.11146 Mean :0.15552
## 3rd Qu.:17.00 3rd Qu.: 0.14286 3rd Qu.:0.16667 3rd Qu.:0.23077
## Max. :35.00 Max. : 1.00000 Max. :0.60000 Max. :1.00000
## NA's :1 NA's :1 NA's :1
## SentimentHE NegativityHE PositivityHE SentimentLM
## Min. :-0.250000 Min. :0.000000 Min. :0.0000 Min. :-0.50000
## 1st Qu.: 0.000000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:-0.08333
## Median : 0.000000 Median :0.000000 Median :0.0000 Median : 0.00000
## Mean : 0.005803 Mean :0.004601 Mean :0.0104 Mean :-0.04076
## 3rd Qu.: 0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.: 0.00000
## Max. : 0.200000 Max. :0.250000 Max. :0.2000 Max. : 0.33333
## NA's :1 NA's :1 NA's :1 NA's :1
## NegativityLM PositivityLM RatioUncertaintyLM SentimentQDAP
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :-0.60000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.: 0.00000
## Median :0.03571 Median :0.00000 Median :0.000000 Median : 0.00000
## Mean :0.05781 Mean :0.01705 Mean :0.008136 Mean : 0.04504
## 3rd Qu.:0.08696 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.: 0.13793
## Max. :0.50000 Max. :0.33333 Max. :0.333333 Max. : 0.66667
## NA's :1 NA's :1 NA's :1 NA's :1
## NegativityQDAP PositivityQDAP
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.06897 Median :0.1111
## Mean :0.07779 Mean :0.1228
## 3rd Qu.:0.12500 3rd Qu.:0.1818
## Max. :0.60000 Max. :0.6667
## NA's :1 NA's :1
Dictionary methods are pretty quick! Here we can see that, by default, the SentimentAnaysis package gives us the results from 13 different approaches. Here is a quick breakdown:
Here are some glimpses as the dictionaries being used:
GI <- loadDictionaryGI()
data.frame(pos = GI$positiveWords[1:10],
neg = GI$negativeWords[1:10])
## pos neg
## 1 abid abandon
## 2 abil abat
## 3 abl abdic
## 4 abound abhor
## 5 absolv abject
## 6 absorb abnorm
## 7 absorpt abolish
## 8 abund abomin
## 9 acced abras
## 10 accentu abrupt
HE <- loadDictionaryHE()
data.frame(pos = HE$positiveWords[1:10],
neg = HE$negativeWords[1:10])
## pos neg
## 1 abov below
## 2 accomplish challeng
## 3 achiev declin
## 4 beat decreas
## 5 best depress
## 6 better deterior
## 7 certain difficult
## 8 certainti difficulti
## 9 definit disappoint
## 10 deliv down
LM <- loadDictionaryLM()
data.frame(pos = LM$positiveWords[1:10],
neg = LM$negativeWords[1:10])
## pos neg
## 1 abl abandon
## 2 abund abdic
## 3 acclaim aberr
## 4 accomplish abet
## 5 achiev abnorm
## 6 adequ abolish
## 7 advanc abrog
## 8 advantag abrupt
## 9 allianc absenc
## 10 assur absente
QDAP <- loadDictionaryQDAP()
data.frame(pos = QDAP$positiveWords[1:10],
neg = QDAP$negativeWords[1:10])
## pos neg
## 1 a plus abnorm
## 2 abound abolish
## 3 abund abomin
## 4 access abort
## 5 acclaim abrad
## 6 acclam abras
## 7 accolad abrupt
## 8 accommod abscond
## 9 accomod absenc
## 10 accomplish absent mind
Let’s take a look at how the GI and QDAP metric look in comparison to each other:
library(ggplot2)
ggplot(sentiments,aes(x=SentimentGI,y=SentimentQDAP)) +
geom_point()
They are strongly linearly related to each other, but the dictionary used matters for the score! Worth note is that there is a positive relationship between positivity and negativity scores as well:
ggplot(sentiments,aes(x=NegativityGI,y=PositivityGI)) +
geom_point()
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Let’s take a quick look at a few of the most positive Tweets!
AOC$QDAPpos <- sentiments$PositivityQDAP
AOC %>%
arrange(desc(QDAPpos)) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 995 × 1
## data.text
## <chr>
## 1 "@msolurin Thank you for your work! "
## 2 "Wow. A truly incredible breakthrough moment.\n\nThank you @NASA and to ever…
## 3 "RT @jimmy_wales: What Wikipedia did: we stood strong for our principles and…
## 4 "Wow. https://t.co/1iDO2rX8U0"
## 5 "That was fast https://t.co/ROLUIFYsIV"
## 6 "@Fuddmuggler Understandable"
## 7 "@mattduss Thanks!"
## 8 "@NabilahIslam Congratulations! "
## 9 "RT @KathyHochul: Honored to have Deco's support "
## 10 "Thank you to https://t.co/5LURrPOG6U"
## # … with 985 more rows
AOC$QDAPneg <- sentiments$NegativityQDAP
AOC %>%
arrange(desc(QDAPneg)) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 995 × 1
## data.text
## <chr>
## 1 "I have yet to hear a real explanation from any official hesitating to conde…
## 2 "Lets see if he throttles this as negativity too "
## 3 "@PabloReports No worries "
## 4 "Forced pregnancy is a crime against humanity."
## 5 "RT @Acyn: Jones: To expel voices of opposition and dissent is a signal of a…
## 6 "RT @justicedems: \"Jordan Neely was killed by public policy. He was killed …
## 7 "If a petty HOA complaint were a person https://t.co/bJVorhQh8Y"
## 8 "RT @BMWEDIBT: The only chance we had at obtaining sick leave was to pass bo…
## 9 "RT @IndivisibleTeam: If you missed our phonebank with @AOC: \"Contrary to w…
## 10 "RT @radleybalko: Fearmongering works:\n\nOK violent crime rate: 458 per 100…
## # … with 985 more rows
And the top performers using the GI dictionary:
AOC$GIpos <- sentiments$PositivityGI
AOC %>%
arrange(desc(GIpos)) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 995 × 1
## data.text
## <chr>
## 1 "BUY*"
## 2 "@harikondabolu I completely and totally understand. This is hilarious"
## 3 "No matter where you are in the United States, we will build community with …
## 4 "RT @RepEscobar: Today, Congresswoman Escobar was arrested in front of the S…
## 5 "RT @EricLevitz: The sole reason why the GOP is able to demand one-way conce…
## 6 "RT @JaneMayerNYer: Lawmakers have just added a provision to the National De…
## 7 "@Fuddmuggler Understandable"
## 8 "@mattduss Thanks!"
## 9 "@NabilahIslam Congratulations! "
## 10 "RT @KathyHochul: Honored to have Deco's support "
## # … with 985 more rows
And negative:
AOC$GIneg <- sentiments$NegativityGI
AOC %>%
arrange(desc(GIneg)) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 995 × 1
## data.text
## <chr>
## 1 "I have yet to hear a real explanation from any official hesitating to conde…
## 2 "Lets see if he throttles this as negativity too "
## 3 "@PabloReports No worries "
## 4 "Forced pregnancy is a crime against humanity."
## 5 "RT @equalityAlec: According to FBI crime estimates released today, \"violen…
## 6 "Republicans keep blaming mass shootings on mental health, but then defend t…
## 7 "RT @justicedems: \"Jordan Neely was killed by public policy. He was killed …
## 8 "RT @adamconover: This is why were striking. The studios are trying to turn …
## 9 "RT @housing4allNY: Fighting to pass #GoodCause eviction is essential. We h…
## 10 "RT @NYTWA: NYC Uber drivers are ON STRIKE!!! Our strike goes until 11:59 pm…
## # … with 985 more rows
While some Tweets appear counter-intuitively placed, the division into positive and negative seems about right! More generally, the sentiment score should give a more balanced impression. Here are the Tweets with the highest GI sentiment:
AOC$GIsentiment <- sentiments$SentimentGI
AOC %>%
arrange(desc(GIsentiment)) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 995 × 1
## data.text
## <chr>
## 1 "BUY*"
## 2 "@harikondabolu I completely and totally understand. This is hilarious"
## 3 "RT @EricLevitz: The sole reason why the GOP is able to demand one-way conce…
## 4 "@Fuddmuggler Understandable"
## 5 "@mattduss Thanks!"
## 6 "@NabilahIslam Congratulations! "
## 7 "RT @KathyHochul: Honored to have Deco's support "
## 8 "Thank you to https://t.co/5LURrPOG6U"
## 9 "@originalspin Oh thank you! I never knew this and will correct it moving fo…
## 10 "RT @Eve6: Its just funny that as the quality of everything from airlines to…
## # … with 985 more rows
And the lowest:
AOC %>%
arrange(GIsentiment) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 995 × 1
## data.text
## <chr>
## 1 "I have yet to hear a real explanation from any official hesitating to conde…
## 2 "Lets see if he throttles this as negativity too "
## 3 "@PabloReports No worries "
## 4 "RT @justicedems: \"Jordan Neely was killed by public policy. He was killed …
## 5 "RT @adamconover: This is why were striking. The studios are trying to turn …
## 6 "RT @NYTWA: NYC Uber drivers are ON STRIKE!!! Our strike goes until 11:59 pm…
## 7 "If a petty HOA complaint were a person https://t.co/bJVorhQh8Y"
## 8 "RT @akela_lacy: Almost 1/4 of this money was wasted trying to defeat Summer…
## 9 "RT @equalityAlec: According to FBI crime estimates released today, \"violen…
## 10 "RT @IndivisibleTeam: If you missed our phonebank with @AOC: \"Contrary to w…
## # … with 985 more rows
A more sophisticated analysis might include custom dictionaries or custom rules for aggregating the scores to fit the use case and corpus you are working with. In general, however, even the default methods do a good job at sorting out the clearly positive from the clearly negative.
To get a better understanding of how sentiment analysis works, and give you the ability to easily create your own custom dictionaries, let’s take a look at the tidy approach to sentiment analysis. Let’s load in the required packages:
library(tm)
library(dplyr)
library(tidytext)
library(textdata)
library(tidyr)
Step 1 of the process is to convert our text into a corpus and then a document-term matrix. For convenience, we will use the canned cleaning abilities of the latter be used. But, as noted in a previous lab, buyer beware when using pre-canned cleaning procedures for text analysis!
AOC <- AOC[which(nchar(AOC$data.text) > 15),]
AOC$data.text <- gsub("amp","",AOC$data.text)
corp <- VCorpus(VectorSource(AOC$data.text))
dt <- DocumentTermMatrix(corp, control = list(removePunctuation = T,
tolower = T,
removeNumbers = T,
stopwords = T,
stemming = T))
tidy_dt <- tidy(dt)
tidy_dt
## # A tibble: 13,417 × 3
## document term count
## <chr> <chr> <dbl>
## 1 1 blue 1
## 2 1 exclus 1
## 3 1 fail 1
## 4 1 found 1
## 5 1 hate 1
## 6 1 new 1
## 7 1 percent 1
## 8 1 post 1
## 9 1 remov 1
## 10 1 research 1
## # … with 13,407 more rows
Let’s load in the bing dictionary for use:
bing_dict <- get_sentiments("bing")
bing_dict
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
We can then merge these two datasets together:
tidy_dt %>%
inner_join(bing_dict,by=c(term="word")) -> tidy_sentiment
as_tibble(tidy_sentiment)
## # A tibble: 1,334 × 4
## document term count sentiment
## <chr> <chr> <dbl> <chr>
## 1 1 fail 1 negative
## 2 1 hate 1 negative
## 3 3 boost 1 positive
## 4 3 fake 1 negative
## 5 3 gain 1 positive
## 6 5 clear 1 positive
## 7 6 debt 1 negative
## 8 6 great 1 positive
## 9 6 limit 1 negative
## 10 6 thank 1 positive
## # … with 1,324 more rows
To aggregate these sentiments into a document-score, we can do the following:
tidy_sentiment %>%
dplyr::count(document, sentiment, wt = count) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment) -> document_scores
as_tibble(document_scores)
## # A tibble: 643 × 4
## document negative positive sentiment
## <chr> <dbl> <dbl> <dbl>
## 1 128 11 0 -11
## 2 221 6 0 -6
## 3 51 5 0 -5
## 4 687 5 0 -5
## 5 794 6 1 -5
## 6 895 5 0 -5
## 7 115 4 0 -4
## 8 135 5 1 -4
## 9 222 5 1 -4
## 10 251 5 1 -4
## # … with 633 more rows
Let’s take a look at the most negative tweets from this method:
AOC$document <- rownames(AOC)
AOC <- merge(AOC,document_scores,by="document",all.x=T)
AOC %>%
arrange(sentiment) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 985 × 1
## data.text
## <chr>
## 1 "I represent Rikers. I cannot tell you how many times Ive heard from both CO…
## 2 "RT @RyanElward: UPS RANK AND FILE \n\nThe @Teamsters are introducing a nat…
## 3 "@JCColtin @CityAndStateNY Is this the new who youd want to have a beer with…
## 4 "For full context, one of Greenes companions in that video was part of the v…
## 5 "I remember how folks stepped up to help Texans when you left them cold and …
## 6 "80%+ of rapes go unreported to police. Should those be treated as false too…
## 7 "Our country criminalizes poverty and homelessness while making it impossibl…
## 8 "RT @chrislhayes: What on *earth* is this statement?"
## 9 "RT @JohnTeufelNYC: NYPD officers who work for five years will now make appr…
## 10 "RT @ManhattanDA: https://t.co/wZ4X6SVDLy"
## # … with 975 more rows
And the most positive:
AOC %>%
arrange(desc(sentiment)) %>%
dplyr::select(data.text) %>%
as_tibble()
## # A tibble: 985 × 1
## data.text
## <chr>
## 1 "What is it with people randomly blaming the mere existence of others for th…
## 2 "RT @MaxKennerly: amazon wanted $3.5 billion in benefits from NY to build th…
## 3 "Lets goooooo https://t.co/2f1CMrBWCc"
## 4 "The coalition united in this push represents a significant breakthrough.\n\…
## 5 "RT @JakeSherman: https://t.co/26w22gLDb0"
## 6 "You sound insecure. As you should be.\n\nYour attempt to seize bodily auton…
## 7 "Jokes aside, this is setting the stage for major potential harm when a natu…
## 8 "RT @nycDSA: Last week we rallied with @AOC @Gonzalez4NY @JuliaCarmel__ @Zoh…
## 9 "@ebottcher Im so sorry youre going through this. If you or your staff need …
## 10 "RT @RollingStone: Bad Bunnys highly anticipated music video for El Apagn in…
## # … with 975 more rows
Not bad, but some obvious room for improvement! One of issues with simple sentiment analysis is that tweets like the following have a lot of positive words which are negated.
AOC %>%
arrange(desc(sentiment)) %>%
dplyr::select(data.text) %>%
.[1,] %>%
as.character() %>%
strsplit(.," ") %>%
unlist() -> words
words
## [1] "What" "is"
## [3] "it" "with"
## [5] "people" "randomly"
## [7] "blaming" "the"
## [9] "mere" "existence"
## [11] "of" "others"
## [13] "for" "their"
## [15] "own" "descent"
## [17] "into" "embracing"
## [19] "neo-nazism?" "Like"
## [21] "girl" "you"
## [23] "did" "that"
## [25] "all" "on"
## [27] "your" "own.\n\nUnless"
## [29] "her" "suggestion"
## [31] "here" "is"
## [33] "she" "started"
## [35] "endorsing" "great"
## [37] "replacement" "theory"
## [39] "because" "she"
## [41] "couldnt" "treat"
## [43] "me" "like"
## [45] "the" "help"
## [47] "https://t.co/5dC2JsKoNY"
bing_dict[which(bing_dict$word %in% words),]
## # A tibble: 4 × 2
## word sentiment
## <chr> <chr>
## 1 endorsing positive
## 2 great positive
## 3 like positive
## 4 randomly negative
So that’s why it was scored positively! Observing this might make you create your own dictionary where, for example, terms like “neo-nazism” etc are treated negatively. More generally, simple dictionary methods such as the above cannot pick up on complex negative connotations such as those held in this tweet.
An alternative approach might be to use a statistical model trained on labeled data and used to predict onto the rest of the dataset. To illustrate this, I hand coded 100 randomly selected Tweets as being positive (1), negative (-1), or neutral/unclear (0).
set.seed(1234)
inds <- sample(1:nrow(AOC),100)
training <- AOC[inds,c("document","data.text")]
predict <- AOC[-inds,]
training$data.text[1:6]
## [1] "RT @RepAOC: \"I had a member of the Republican caucus threaten my life and the Republican caucus rewarded him with one of the most prestigio"
## [2] "Does anyone else miss BBM or is that just me https://t.co/XqU3GMCRmN"
## [3] "Now were talking! Time for people to see a real, forceful push for it. Use the bully pulpit. We need more. https://t.co/dZ1qhdu8iM"
## [4] "@bilalfarooqui If that angers you, maybe you can try comparing it to an NYC teachers salary. They make $50k less under this deal, have higher ed requirements many must go into debt to finance, and arent eligible for overtime."
## [5] "Lmao at a billionaire earnestly trying to sell people on the idea that free speech is actually a $8/mo subscription plan"
## [6] "We are witnessing a judicial coup in process.\n\nIf the President and Congress do not restrain the Court now, the Court is signaling they will come for the Presidential election next.\n\nAll our leaders - regardless of party - must recognize this Constitutional crisis for what it is. https://t.co/DzoIh4n08D"
# training$document <- as.character(training$document)
# training$data.text <- as.character(training$data.text)
# write.csv(training,"training.csv")
The first of these tweets is negative, the second and third positive, the fourth negative, the fifth and sixth negative, and so on. Here is how I coded the entire set:
coded <- read.csv("https://www.dropbox.com/s/01er5ypnmtc6o43/training_coded.csv?dl=1")
table(coded$score)
##
## -1 0 1
## 48 25 27
Let’s look at some positive tweets:
head(coded$data.text[which(coded$score == 1)])
## [1] "Does anyone else miss BBM or is that just me https://t.co/XqU3GMCRmN"
## [2] "Now were talking! Time for people to see a real, forceful push for it. Use the bully pulpit. We need more. https://t.co/dZ1qhdu8iM"
## [3] "RT @librarycongress: Hearing @lizzo play some of the Library's priceless antique instruments on Monday was such a gift, and we were honored"
## [4] "Shout out to NY Assemblymember @yuhline who has more info here: https://t.co/ZQbA93bUZE"
## [5] "@Welcome2theBX @KathyHochul Thank you! We are securing similar pedestrian investments for Westchester Square on the Bronx side as well!"
## [6] "RT @billscher: Progressives with two heartland wins tonight: Wisconsin and Chicago."
Some negative tweets:
head(coded$data.text[which(coded$score == -1)])
## [1] "RT @RepAOC: \"I had a member of the Republican caucus threaten my life and the Republican caucus rewarded him with one of the most prestigio"
## [2] "@bilalfarooqui If that angers you, maybe you can try comparing it to an NYC teachers salary. They make $50k less under this deal, have higher ed requirements many must go into debt to finance, and arent eligible for overtime."
## [3] "Lmao at a billionaire earnestly trying to sell people on the idea that free speech is actually a $8/mo subscription plan"
## [4] "We are witnessing a judicial coup in process.\n\nIf the President and Congress do not restrain the Court now, the Court is signaling they will come for the Presidential election next.\n\nAll our leaders - regardless of party - must recognize this Constitutional crisis for what it is. https://t.co/DzoIh4n08D"
## [5] "Reminder: This is who the Republican Party elects + elevates to positions of power. This is how they act in the halls of Congress, and this the exle they set for acolytes to follow.\n\nThese people want media to both sides fascism. Dont fall for it. \nhttps://t.co/9Y0GV2Revw"
## [6] "RT @ryanlcooper: they're never going to forgive this women for being braver than them https://t.co/JsqJDhpruO"
And some tweets I couldn’t classify:
head(coded$data.text[which(coded$score == 0)])
## [1] "RT @ABCNewsLive: .@AOC joins @LinseyDavis following the Republican response to Pres. Bidens #SOTU: \"It's unsurprising that Gov. Huckabee S"
## [2] "When you encounter someone with a misperception, share your story. Your personal experiences and facts can powerfully dismantle common misperceptions and propaganda in the people around you. Its one of the most powerful tools we have - far more powerful than any TV pundit."
## [3] "Its almost midnight. Welcome to political asmr Twitter \n\n NY Early Voting starts next Saturday Oct 29th \n\n California Early Ballots have hit mailboxes, send yours back this week https://t.co/GyTRJaigt2"
## [4] "@Fuddmuggler Understandable"
## [5] "RT @POTUS: Lets be clear about what changes Republicans in Congress want to make to Medicare and Social Security.\n\nThey want to raise th"
## [6] "RT @katie_honan: Queens safer than Nassau, would you look at that"
What we are going to do is use words as features in a predictive model. We won’t do anything fancy here, like splitting into a train and test set or cross-validating the results, but such would likely improve the model. Let’s add the scores back onto our main data and then clean the tweets into usable format. For easily replicability, we will fist collect all our cleaning tasks into one function, the output of which will be a tidy document term matrix:
clean_text <- function(text_vector){
text_vector %>%
gsub('\\n','\\\\n',.) %>%
gsub("(www|http:|https:)([^(\\s)]*)","",.) %>%
gsub('\\\\n','',.) %>%
gsub("[[:punct:] ]+",' ',.) %>%
gsub('[[:digit:]]+', '', .) %>%
gsub('[^\x01-\x7F]', '', .) %>%
tolower() -> text_vector
stopwords <- c("amp",stopwords("English"))
sw2 <- stopwords
sw2 <- gsub("'"," ",sw2)
stopwords <- unique(stopwords,sw2)
for(i in stopwords){
text_vector <- gsub(paste0("\\b",i,"\\b"), "", text_vector)
}
corpus <- VCorpus(VectorSource(text_vector))
dt <- DocumentTermMatrix(corpus)
tidy_dt <- tidy(dt)
tidy_dt$term <- stemDocument(tidy_dt$term)
tidy_dt %>%
dplyr::group_by(document,term) %>%
dplyr::summarise(count = sum(count, na.rm=T)) -> tidy_dt
unique_terms <- unique(tidy_dt$term)
sdm <- stringdistmatrix(unique_terms, unique_terms, method = "jw", useNames = TRUE)
diag(sdm) <- NA
inds <- which(apply(sdm,2,min, na.rm=T) <= 0.05)
min_dist <- apply(sdm,2,min,na.rm=T)
thresh <- min_dist < 0.05
sub <- sdm[thresh,thresh]
old_terms <- colnames(sub)
out <- list()
for(i in old_terms){
out[[i]] <- rownames(sub)[which.min(sub[,i])]
}
dict <- data.frame(term1 = old_terms,
term2 = unlist(out))
dict$first_term <- nchar(as.character(dict$term1)) < nchar(as.character(dict$term2))
dict$replacement <- ifelse(dict$first_term,
as.character(dict$term1),
as.character(dict$term2))
dict$term1 <- as.character(dict$term1)
dict$term2 <- as.character(dict$term2)
dict$replacement <- as.character(dict$replacement)
for(i in 1:nrow(tidy_dt)){
term <- tidy_dt[i,"term"]
if(term %in% dict$term1){
tidy_dt$term[i] <- dict$replacement[match(term,dict$term1)]
}
}
tidy_dt %>%
dplyr::group_by(document,term) %>%
dplyr::summarise(count = sum(count)) %>%
arrange(as.numeric(as.character(document))) -> tidy_dt
tidy_dt
}
Let’s clean the text and get it ready to go, focusing only on those observations for which we have a positive or negative sentiment determined. Details put to the side, to fit the model we are going to use what is called ridge regression to estimate the association between the terms in these tweets and the our positive/negative association score.
set.seed(1234)
inds <- sample(1:nrow(AOC),100)
training <- AOC[inds,]
training$score <- coded$score
training %>%
filter(score == 1 | score == -1) -> training
tidy_dt <- clean_text(training$data.text)
colnames(tidy_dt)[1] <- "Document"
tidy_dt$count <- as.numeric(tidy_dt$count)
tidy_dt %>%
pivot_wider(id_cols = Document,
names_from = term,
values_from = count,
values_fill = 0) -> wide
wide$score <- training$score
library(glmnet)
X <- as.matrix(wide[,-c(1,ncol(wide))])
y <- wide$score
lambdas <- 10^seq(2, -3, by = -.1)
cv_ridge <- cv.glmnet(X, y, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
ridge_reg <- glmnet(X, y, alpha = 0, family = 'gaussian', lambda = optimal_lambda)
Here is the main result – essentially a data-driven dictionary!
custom_dict <- data.frame(term = rownames(ridge_reg$beta),
association = ridge_reg$beta[,1])
rownames(custom_dict) <- NULL
head(custom_dict,10)
## term association
## 1 caucus -0.01895338
## 2 life -0.03790213
## 3 member -0.03148998
## 4 one -0.01719916
## 5 prestigio -0.03789975
## 6 repaoc -0.03723530
## 7 republican -0.02141854
## 8 reward -0.03789196
## 9 threaten -0.03788965
## 10 anyon 0.07088130
Let’s look at the most negative terms:
custom_dict %>%
arrange(association) %>%
head()
## term association
## 1 katiemcfaddenni -0.05412858
## 2 nooooo -0.05412755
## 3 govern -0.05320666
## 4 sir -0.05320317
## 5 incorrect -0.04733786
## 6 elonmusk -0.04733737
And now the most positive:
custom_dict %>%
arrange(desc(association)) %>%
head()
## term association
## 1 goooooo 0.08279192
## 2 sunday 0.07975051
## 3 happi 0.07974974
## 4 zoandbehold 0.07789266
## 5 capitol 0.07789195
## 6 boldprogress 0.07479448
To make this useful for the rest of the data, we need to first clean it as before.
AOC_clean <- clean_text(AOC$data.text)
## `summarise()` has grouped output by 'document'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'document'. You can override using the
## `.groups` argument.
Now let’s merge our dictionary onto this
AOC_clean <- merge(AOC_clean,custom_dict,by="term",all.x=T)
summary(AOC_clean)
## term document count association
## Length:13247 Length:13247 Min. :1.000 Min. :-0.054
## Class :character Class :character 1st Qu.:1.000 1st Qu.:-0.030
## Mode :character Mode :character Median :1.000 Median :-0.017
## Mean :1.044 Mean :-0.005
## 3rd Qu.:1.000 3rd Qu.: 0.013
## Max. :5.000 Max. : 0.083
## NA's :6854
Note that there are a lot of terms for which we have no association! This is because the terms did not appear in our training dataset and so we were unable to learn their association with tweets of a particular sentiment; an unsurprsing state of affairs given our small training size and the limited vocabulary therein. Let’s code these as zero’s for the time being and aggregate the scores to the tweet level.
AOC_clean$association <- ifelse(is.na(AOC_clean$association),0,AOC_clean$association)
AOC_clean %>%
group_by(document) %>%
dplyr::summarise(custom_sentiment = sum(count * association)) -> custom_scores
AOC$document_fix <- 1:nrow(AOC)
AOC <- merge(AOC,custom_scores,by.x="document_fix",by.y="document",all.x=T)
summary(AOC$custom_sentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.54276 -0.09858 -0.03108 -0.03705 0.02567 0.74462 13
Let’s see how we did! Let’s start by looking at the negative Tweets
AOC %>%
arrange(custom_sentiment) %>%
select(data.text) %>%
head()
## data.text
## 1 Tyre Nichols should be alive. Charges alone arent justice. Change is.\n\nAt least 1,176 people were killed by law enforcement last year - a record. Billions in trainings, body cams, and reforms havent stopped it. In fact, its gotten worse. We must grow out of this cycle together
## 2 https://t.co/TCSPMIaOlb \n\nAfter leading the party to a catastrophic ballot measure loss that wouldve saved Dem House seats, the party chair (Jacobs) compared a Black woman Dem nominee to the KKK. He was protected.\n\nLast nights underperformance is a consequence of that decision.
## 3 Trying to advance police reporting as the standard for belief denies the overwhelming reality of sexual assault, rape, &; child/domestic abuse &; strengthens a system that protects violators &; silences victims.\n\nIts also one exle why criminalization and justice arent synonyms.
## 4 Just a few months ago I literally had to explain to Republican members of Congress how periods work.\n\nTheir complete and utter incompetence is now killing women and pregnant people across the US.\n\nThere remains no legitimate grounding or basis to force birth in the United States. https://t.co/KWay0Enguy
## 5 Lastly, many moderate dems + leaders made it very clear that our help was not welcome nor wanted. Despite our many, many offers. Yet found ways to try to help from afar. So for them to blame us for respecting their approach in their districts is laughable. \n\nTake some ownership.
## 6 It is appalling how so many take advantage of headlines re: crime for an obsolete tough on crime political, media, &; budgetary gain, but when a public murder happens that reinforces existing power structures, those same forces rush to exonerate&;look the other way. We shouldnt.
Now the most positive:
AOC %>%
arrange(desc(custom_sentiment)) %>%
select(data.text) %>%
head()
## data.text
## 1 Thank you for calling attention to my pet project of making New York State a global leader in combating climate change and creating tons of good, high-paying jobs for people in the process. Im proud of it! \n\nWhats your pet project? Being a hater for a living? https://t.co/VkpBLQyNdu
## 2 The last time we stood with @Teamsters Local 202, we stared down a national food crisis over resistance to a $1 raise.\n\nBack then, railroad workers stood with us. They turned trains around to not cross a picket line.\n\nWe won then, and we can win now. \nLets get these sick days https://t.co/YAUFIKRawD
## 3 Shout out to Massachusetts and the people of Marthas Vineyard for showing the world what the best of America looks like \n\nIts unsurprising that they also send some of the best to Congress, like @ewarren, @EdMarkey, @AyannaPressley, @RepMcGovern, @RepKClark and so many more!
## 4 @Welcome2theBX @KathyHochul Thank you! We are securing similar pedestrian investments for Westchester Square on the Bronx side as well!
## 5 RT @MattGertz: Great news everyone, "America's Crime Crisis" (trademark Fox News) is over!\n\nAll it took was Election Day eliminating the pe
## 6 RT @therecount: Rep. @AOC (D-NY), speaking to a rowdy crowd, praises the Stand Up To Violence program for reducing Bronx crime:\n\nWe reduce
Pretty cool, but a lot of improvements could be made. For example, by increasing the size of the training set we would get better estimates of the association between words and our subjective positive/negative sentiment assessment of the tweet as a whole. We might use a more sophisticated methodology for predicting the scores, be more selective of the terms we include in the model, etc etc.
Topic models are a natural language processing technique in which a statistical model is used to discover abstract “topics” that occur in a collection of documents. Similar to our exploration of PCA, we would expect documents that speak to a particular topic to use more similar words than those that don’t. A classic example is that we would expect the terms “dog” and “bone” to occur in documents about dogs more than “cat” and “meow,” which are more likely to occur in documents about cats. Unlike PCA, documents will be given probabilities of assignment into a variety of topics which allows us to reflect the notion that a single document might contain multiple topics.
Putting details to the side, let’s take a look at one of the most common varieties of topic model: Latent Dirichlet allocation. Let’s start with a simple example using the AOC data where we fit only three topics to our text. We start by cleaning as usual, if in a somewhat hackish way.
library(topicmodels)
AOC_clean <- clean_text(AOC$data.text)
AOC_clean %>%
group_by(document) %>%
dplyr::summarise(clean_text = paste(term, collapse = " ")) -> AOC_reduced
corp <- VCorpus(VectorSource(AOC_reduced$clean_text))
dt <- DocumentTermMatrix(corp)
m1 <- LDA(dt, k = 3,
control = list(seed = 1234))
Fitting the model is the easy part! The rest of the analysis involves exploring and interpreting the model to determine which topics were found and if they are meaningful.
To get started, lets’ take a look at the per-topic-per-word probabilities estimated by the model.
tidy(m1, matrix = "beta")
## # A tibble: 11,610 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 aaronnarraph 0.0000873
## 2 2 aaronnarraph 0.00000121
## 3 3 aaronnarraph 0.000139
## 4 1 abandon 0.000238
## 5 2 abandon 0.000270
## 6 3 abandon 0.000173
## 7 1 abbott 0.0000721
## 8 2 abbott 0.000147
## 9 3 abbott 0.00000755
## 10 1 abc 0.000119
## # … with 11,600 more rows
Note that each term has three rows, each corresponding to one of the estimated topics. The beta column is the estimated probability from being generated from that particular topic. We can visualize the top words included in topics by doing the following:
tidy(m1, matrix = "beta") %>%
group_by(topic) %>%
slice_max(beta,n=10) %>%
ungroup() %>%
arrange(topic,-beta) %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(x=beta, y= term, fill=factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +
scale_y_reordered()
Cool! The first topic latches on to things like herself, power and justice. The second category looks like it deals with republicans and voting. The last seems to refer to people and the house.
One of the most difficult preliminary tasks when fitting topic models is first determining the number of topics to fit. A good first-cut solution to this problem is to estimate a variety of models and then choose that model which provides the best fit to the data. For our purposes, we can use the log-likelihood values saved in the fitted objects to choose a model which has enough, but not too many, topics by calculating AIC values.
First we fit the models, using a bit of parallel processing to speed things up…
library(pbapply)
library(parallel)
cl <- makeCluster(detectCores())
clusterExport(cl,"dt")
clusterEvalQ(cl,library(topicmodels))
## [[1]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[2]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[3]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[4]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[5]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[6]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[7]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
##
## [[8]]
## [1] "topicmodels" "stats" "graphics" "grDevices" "utils"
## [6] "datasets" "methods" "base"
mods <- pblapply(1:20, function(x)LDA(dt, k = x + 1,
control = list(seed = 1234)), cl=cl)
stopCluster(cl)
To evaluate which model to use, we will use the perplexity of the model. Discussing the mathematical details of this metric is outside the realm of our discussion, but the metric relates to how well a model predicts the data.
perplex <- sapply(mods,function(x)perplexity(x))
plot(perplex)
We see a HUGE dip around 11 topics, so let’s take a look at that!
The results of the model are the following:
opt_mod <- mods[[10]]
tidy(opt_mod, matrix = "beta") %>%
group_by(topic) %>%
slice_max(beta,n=10) %>%
ungroup() %>%
arrange(topic,-beta) %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(x=beta, y= term, fill=factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +
scale_y_reordered()
That’s not bad! We could likely improve performance further by doing additional cleaning prior to estimating the topic models, perhaps using a custom dictionary to remove non-political words. We might also use a different criteria for selecting the number of topics, increase the number of Tweets analyzed, etc.
But now that we have a decent topic model, we might want to go ahead and associate Tweets with their relevant topics. This will help us go further to determine which texts are associated with which topics, and will likely help us further determine their meaning. To do this, we need to extract another parameter from the fitted model:
tidy(opt_mod, matrix = "gamma")
## # A tibble: 10,692 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1 1 0.00233
## 2 2 1 0.00271
## 3 3 1 0.0913
## 4 4 1 0.00157
## 5 5 1 0.0245
## 6 6 1 0.00143
## 7 7 1 0.00132
## 8 8 1 0.00137
## 9 9 1 0.00251
## 10 10 1 0.00251
## # … with 10,682 more rows
Each of these values is an estimated proportion of words from that document that are generated from that topic. Many of these appear to be quite narrow in scope, which makes sense given the nature of Tweets, but a few look like they are a mix of topics. Let’s take a look at a few Tweets that rank highly on each topic. Here are the top 5 for topic 2:
tidy(opt_mod, matrix = "gamma") %>%
filter(topic == 2) %>%
slice_max(gamma, n = 5) %>%
dplyr::select(document) %>%
as.vector() %>%
unlist() %>%
as.numeric() %>%
AOC_reduced[.,"document"] %>%
as.vector() %>%
unlist() %>%
as.character() %>%
as.numeric() %>%
AOC[.,"data.text"]
## [1] "to confront the Courts structure (and core gerontocracy problem of lifetime appointments) via public appeal. While he did not succeed, that check came from the ppl &; Congress, NOT scotus.\n\nThe ruling is Roe, but the crisis is democracy. Leaders must share specific plans for both"
## [2] "We cannot allow Supreme Court nominees lying and/or misleading the Senate under oath to go unanswered.\n\nBoth GOP &; Dem Senators stated SCOTUS justices misled them. This cannot be accepted as precedent.\n\nDoing so erodes rule of law, delegitimizes the court, and imperils democracy. https://t.co/yZW6BKnqFG"
## [3] "Republicans love to ask what would happen if the right-wing harassed someone in restaurant?? as if they havent been doing that since Day 1.\n\nThese are their own tweets from 2019!\n\nSo the answer of what happens when its a Dem is: nothing. 0 sympathy for hypocritical whiners. https://t.co/p9ECQUUS71 https://t.co/QZwdAr5tuz"
## [4] "RT @SawyerHackett: Since taking office, Kyrsten Sinema has voted twice for measures with filibuster carve-outsto raise the debt ceiling an"
## [5] "RT @nowthisnews: 'I believe lying under oath is an impeachable offense' Days after SCOTUS overturned Roe v. Wade, AOC said justices must"
What about topic 9?
tidy(opt_mod, matrix = "gamma") %>%
filter(topic == 9) %>%
slice_max(gamma, n = 5) %>%
dplyr::select(document) %>%
as.vector() %>%
unlist() %>%
as.numeric() %>%
AOC_reduced[.,"document"] %>%
as.vector() %>%
unlist() %>%
as.character() %>%
as.numeric() %>%
AOC[.,"data.text"]
## [1] "The President &; Dem leaders can no longer get away with familiar tactics of committees and studies to avoid tackling our crises head-on anymore:\n\n- Restrain judicial review\n- Open clinics on federal lands\n- Court expansion\n- Expand Fed access/awareness of pill abortions\n- etc"
## [2] "Jokes aside, this is setting the stage for major potential harm when a natural disaster hits and no one knows what agencies, reporters, or outlets are real.\n\nNot long ago we had major flash floods. We had to mobilize trusted info fast to save lives. Today just made that harder https://t.co/bAG8ayBTa6"
## [3] "- Restrain judicial review\n- Expand the court\n- Clinics on federal lands\n- Expand education and access to Plan C\n- Repeal Hyde\n- Hold floor votes codifying Griswold, Obergefell, Lawrence, Loving, etc\n- Vote on Escobars bill protecting clinics\n\nWe can do it!\nWe can at least TRY"
## [4] "Actually thats not quite true! Hyde amdt prevents Fed gov from financing a clinic itself, but does NOT necessarily prevent the Fed gov from leasing land to an independent clinic.\n\nThis is why we shouldnt dismiss ideas out of hand, but try to thoroughly explore + investigate. https://t.co/XldXvwqP8E"
## [5] "@HacknerTyler @ryangrim Perfectly reasonable. This is the best we could do w/ the hand we were dealt - WH sprung this on us &; we had a window of <24h to secure sick leave when they had the votes locked to pass w/ no changes. Tanking wasnt an option bc of GOP votes, we moved to keep sick leave alive"
Not bad! Some improvements could obviously be made to the processing of the text to get more politically relevant results, but as an exploratory tool for answering “what topics might exist, and what terms are associated with them?” unsupervised methods like topic modeling are hard to beat!
As promised, here is some code for accessing the Twitter volume stream.
library(curl)
stream_tweets_v2 <- function(bearer_token, max_seconds, max_tweets){
# Set up endpoint and connection
h <- new_handle()
handle_setheaders(h,"Authorization" = paste0("Bearer ",bearer_token))
endpoint <- "https://api.twitter.com/2/tweets/sample/stream"
con <- curl(endpoint, handle = h)
# Set up termination conditions
max_time <- Sys.time() + max_seconds
if(max_tweets <= 100){
requests <- max_tweets
}else{
requests <- c(rep(100,floor(max_tweets/100)), max_tweets %% 100)
}
max_iter <- length(requests)
now <- Sys.time()
iter <- 1
# Collect Tweets
dumps <- list()
open(con)
while( (max_time > now) & (max_iter >= iter) ){
dumps[[iter]] <- readLines(con,n=requests[iter])
iter <- iter + 1
now <- Sys.time()
}
# Close connection and return
close(con)
all <- unlist(dumps)
out <- stream_in(textConnection(all),flatten = T)
out
}
# Test output (requires bearer token!)
wd <- "D:/Twitter"
setwd(wd)
twitter_info <- read.csv("twitter_info2.csv",
stringsAsFactors = F)
test <- stream_tweets_v2(twitter_info$bearer_token,10,1000)
as_tibble(test)
Imaging what you would have to go through to process that!
Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:
Write a script which…
Save and submit your working R script to the Exercise/Quiz Submission Link by the end of the day (ideally, end of lab session!).