Today we are going to continue our discussion of Twitter by considering how we might begin analyzing Tweets in a more sophisticated manner. To do so we need to learn some fundamentals of cleaning text within R.
To clean text we first need some text. For our purposes it will be sufficient to grab the past 500 Tweets from, say, Nancy Pelosi.
library(dplyr)
library(plyr)
library(stringr)
library(tidyr)
library(tm)
library(tidytext)
library(stringdist)
library(ggplot2)
library(ggdendro)
library(plotly)
library(wordcloud)
wd <- "D:/Twitter"
setwd(wd)
twitter_info <- read.csv("twitter_info2.csv",
stringsAsFactors = F)
If you had access to the API, you could use the following to collect the Tweets we will be using after loading the function into your global environment.
tweets <- last_n_tweets(bearer_token = twitter_info$bearer_token,
user_id = "SpeakerPelosi",
n= 500,
tweet_fields = c("created_at","public_metrics"))
Lacking that ability, you can read them in with the following command to follow along.
tweets <- readRDS(url("https://www.dropbox.com/s/q3hi3cieylgsmyn/Pelosi.RDS?dl=1"))
head(tweets$data.text)
## [1] "Each generation, a courageous few have stepped forward to keep Americans safe and America secure. We owe these patriots, and their families, an unpayable debt of gratitude. Let us always strive to build a world worthy of their sacrifice. https://t.co/LICngt8efi"
## [2] "On #MemorialDay, Americans come together to pay tribute to the valiant servicemembers who gave their last full measure of devotion to defend our Democracy.\n\nToday, and every day, we hold their memories in our hearts, as well as the families and loved ones of our fallen heroes."
## [3] "Republicans are holding middle class families hostage to pass their extreme MAGA agenda and give tax cuts for the wealthy.\n\nA default on America's debt would eliminate jobs, increase housing costs and threaten retirement plans. https://t.co/WKTLQIad23"
## [4] "Today, I joined @DemWomenCaucus to speak about the Republicans’ vote against our veterans, our seniors, our families and our future.\n\nThis default on America is an assault on America's families and America's middle class. https://t.co/lx3aCyfruX"
## [5] ".@USJewishDems are a vital voice for justice, peace and Democracy, helping us pave the way to progress.\n\nProudly, this week I accepted their Defender of Democracy Award, and celebrated their work to bring about a more just society. https://t.co/UR3bjC6nu0"
## [6] "3 years ago, George Floyd was murdered.\n\nIn the wake of this brutal, racist killing, millions have risen up to demand we end the systemic racism & the police brutality that stain our nation.\n\nToday & every day, we must honor George’s memory by continuing the fight for justice."
One of the most basic things that we might want to do when facing a corpus of Tweets is determine whether or not they contain particular phrases or references to particular entities. For this, we can use the grep and grepl functions within R. For example, if we wanted to see what Tweets contain the term “Democrat” we can do the following:
grep("Democrat",
tweets$data.text,
ignore.case = T)
## [1] 10 12 16 26 27 28 30 34 35 62 70 80 92 96 99 101 105 107 111
## [20] 126 128 141 144 163 177 185 186 188 193 194 219 220 226 227 230 231 254 260
## [39] 261 262 288 298 299 301 308 330 333 334 339 344 347 350 367 370 372 375 376
## [58] 377 383 384 385 386 388 394 401 404 414 421 424 425 427 428 431 434 476 477
## [77] 482 483 484 485 487 488 489 492 493 495 497 500
This supplies a vector of index values referring to Tweets which match the search criteria. If we used grepl instead, we get the following output:
grepl("Democrat",
tweets$data.text,
ignore.case = T)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [97] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [109] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
## [193] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [229] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [301] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [337] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [349] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
## [373] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [385] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [421] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE
## [433] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [481] FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
## [493] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
In my experience, grep is more useful for subsetting whereas grepl is more immediately useful for coding variables. For example, if we wanted to add a variable to our data which records whether or not a Tweet references “Democrats” we could do the following:
grepl("Democrat",
tweets$data.text,
ignore.case = T) -> tweets$democrat
Whereas if we wanted to subset down to those Tweets which contain the term “Democrat,” we could do the following:
grepl("Democrat",
tweets$data.text,
ignore.case = T) %>%
tweets[.,] -> tweets_dem
But, of course there are multiple ways to accomplish the same task within R. For example, if we wanted to know which Tweets mention “Trump” or “Republicans,” we could do the following:
grepl("(Trump|Republican)",
tweets$data.text,
ignore.case = T) %>%
tweets[.,] -> tweets_rep
Great. Now lets’s take a look at Tweets mentioning “Democrats.”
Let’s look at the first Tweet from this set, as an example:
tweets_dem$data.text[1]
## [1] "In the last Congress, @HouseDemocrats passed legislation to reinstate the Assault Weapons Ban and enact universal background checks. Republicans must join us now in enacting these critical measures to save lives and secure a safer future for all Americans. https://t.co/uuWNoYZ22m"
Regardless of the Tweet found above (and it was essentially random!), a few things will be worth note. First, there are a number of words which – while important for human readability, are so common as to be useless for text analysis. These terms are known as stop words and generally removed prior to analysis being conducted.
Second, the end of the Tweet contains a link to the Tweet itself. We will want to remove this as well. The same goes for punctuation and special characters which, for our purposes, don’t matter. Simply put, we want to take the above textual information and distill it into a form that is useful for analysis.
Let’s start with removing that link. To accomplish this, we will use the gsub function.
tweets$data.text %>%
gsub("http.*","",.) -> tweets$tweets_clean
tweets$tweets_clean[1]
## [1] "Each generation, a courageous few have stepped forward to keep Americans safe and America secure. We owe these patriots, and their families, an unpayable debt of gratitude. Let us always strive to build a world worthy of their sacrifice. "
Sweet. Let’s get rid of punctuation other than apostrophes next:
tweets$tweets_clean %>%
gsub("[[:punct:] ]+",' ',.) -> tweets$tweets_clean
tweets$tweets_clean[1]
## [1] "Each generation a courageous few have stepped forward to keep Americans safe and America secure We owe these patriots and their families an unpayable debt of gratitude Let us always strive to build a world worthy of their sacrifice "
One will frequently come across Tweets which have a “new line” marker. For example, the following tweets contain this special character:
grep("\\n",tweets$tweets_clean)
## [1] 2 3 4 5 6 14 15 18 20 21 22 23 24 25 26 27 28 29
## [19] 30 31 32 33 34 35 36 38 39 40 41 42 43 45 46 47 48 50
## [37] 51 53 54 55 56 57 58 59 61 62 63 65 66 68 70 72 73 74
## [55] 75 76 77 78 79 82 84 86 89 90 91 92 94 95 96 97 98 99
## [73] 100 101 102 103 104 105 106 107 109 111 112 114 115 116 117 118 119 120
## [91] 121 122 123 125 126 129 130 131 132 133 134 135 136 137 138 139 140 141
## [109] 144 145 146 147 148 149 154 155 156 158 161 162 164 166 168 169 170 177
## [127] 178 180 181 184 186 189 191 192 193 194 195 196 198 199 200 201 202 203
## [145] 204 205 206 207 208 210 211 212 213 214 216 217 218 219 220 221 222 223
## [163] 226 227 228 229 233 236 239 240 241 242 245 246 247 248 255 256 258 261
## [181] 263 266 268 272 277 279 281 282 283 285 286 287 291 293 294 296 301 302
## [199] 304 307 311 312 313 314 320 321 322 323 324 328 329 332 333 334 335 336
## [217] 340 342 343 344 347 351 355 356 359 370 371 373 377 378 392 393 394 395
## [235] 403 404 405 406 408 410 412 413 414 415 416 417 418 422 423 424 425 433
## [253] 434 435 437 442 446 453 454 455 456 457 458 459 465 466 467 470 471 472
## [271] 475 476 478 479 480 481 482 483 484 485 486 487 488 489 490 492 493 494
## [289] 495 497
Let’s take a look at what it looks like in the text:
tweets$tweets_clean[2]
## [1] "On MemorialDay Americans come together to pay tribute to the valiant servicemembers who gave their last full measure of devotion to defend our Democracy \n\nToday and every day we hold their memories in our hearts as well as the families and loved ones of our fallen heroes "
We can remove these with the following code (note the double backslash in the regex both above and below!):
tweets$tweets_clean %>%
gsub('\\n','',.) -> tweets$tweets_clean
tweets$tweets_clean[2]
## [1] "On MemorialDay Americans come together to pay tribute to the valiant servicemembers who gave their last full measure of devotion to defend our Democracy Today and every day we hold their memories in our hearts as well as the families and loved ones of our fallen heroes "
Stopwords are a little bit harder to do manually as there are many of them. Luckily, common dictionaries exist. For example, from the tm package:
stopwords("English")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
Now how do we go about removing these from the Tweets? We want to to this in as much of a “base R” way as possible, so let’s just use what we have learned above. First, note that everything in that list is lowercase, so let’s coerce everything in our clean tweets to be lowercase themselves.
tweets$tweets_clean <- tolower(tweets$tweets_clean)
tweets$tweets_clean[1]
## [1] "each generation a courageous few have stepped forward to keep americans safe and america secure we owe these patriots and their families an unpayable debt of gratitude let us always strive to build a world worthy of their sacrifice "
To get a little bit of practice using R, we will do this a somewhat silly way. In particular, let’s loop over each of these stopwords and remove exact matches. Before doing this, however, we will duplicate the list to account for matches both with and without apostrophes, matching our punctuation filter from above.
stopwords <- c("amp",stopwords("English"))
sw2 <- stopwords
sw2 <- gsub("'"," ",sw2)
stopwords <- unique(stopwords,sw2)
for(i in stopwords){
tweets$tweets_clean <- gsub(paste0("\\b",i,"\\b"), "", tweets$tweets_clean)
}
tweets$tweets_clean[1]
## [1] " generation courageous stepped forward keep americans safe america secure owe patriots families unpayable debt gratitude let us always strive build world worthy sacrifice "
If you wanted to remove additional words you could easily do so by expanding the “stopwords” vector created above. Before proceeding, let’s (1) remove numbers, (2) any remaining words with less than three characters and (3) all non-ASCII characters like emojis, just to see how such tasks look.
tweets$tweets_clean %>%
gsub('[[:digit:]]+', '', .) %>%
gsub('\\b\\w{1,2}\\b','',.) %>%
gsub('[^\x01-\x7F]', '', .) -> tweets$tweets_clean
Now we have something that is much more useable!
Now that we have some relatively clean text, we might want to compute summaries or visualize word frequencies. To do this, we need to form what is known as a document-term matrix where each row is a Tweet and each column is a term or word. For this we will use the tm package and create our first corpus:
corpus <- VCorpus(VectorSource(tweets$tweets_clean))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 500
To access information on the corpus elements, you can treat it like a list!
corpus[[1]]$content
## [1] " generation courageous stepped forward keep americans safe america secure owe patriots families unpayable debt gratitude let always strive build world worthy sacrifice "
Now let’s make the document term matrix:
dt <- DocumentTermMatrix(corpus)
tidy_dt <- tidy(dt)
tidy_dt
## # A tibble: 10,638 × 3
## document term count
## <chr> <chr> <dbl>
## 1 1 always 1
## 2 1 america 1
## 3 1 americans 1
## 4 1 build 1
## 5 1 courageous 1
## 6 1 debt 1
## 7 1 families 1
## 8 1 forward 1
## 9 1 generation 1
## 10 1 gratitude 1
## # … with 10,628 more rows
Something we might note at this juncture is that a number of words are virtually the same (like america and americans). To begin, we might want to do at this point is reduce words to their “word-stems,” essentially identifying the part of the word responsible for its lexical meaning and deleting modifiers. To do this we might..
tidy_dt$term <- stemDocument(tidy_dt$term)
tidy_dt %>%
group_by(document,term) %>%
summarise(count = sum(count)) -> tidy_dt
## `summarise()` has grouped output by 'document'. You can override using the
## `.groups` argument.
tidy_dt
## # A tibble: 10,543 × 3
## # Groups: document [500]
## document term count
## <chr> <chr> <dbl>
## 1 1 alway 1
## 2 1 america 1
## 3 1 american 1
## 4 1 build 1
## 5 1 courag 1
## 6 1 debt 1
## 7 1 famili 1
## 8 1 forward 1
## 9 1 generat 1
## 10 1 gratitud 1
## # … with 10,533 more rows
We can see this had an effect by reducing the length of our dataframe from 10,638 rows to 10,543. Nonetheless, we still have america/american in the first document! Another approach to accomplish a similar task is to group words together utilizing some notion of “string distance.” There are a number of options included within the stringdist package in R, an excellent summary of approaches being given by the stringdist-metrics help file.
For our purposes it will be sufficient to use the “Jaro” distance, which is essentially a weighted average of the rate of character matches between strings, a value of 0 indicating an exact match and a value of 1 indicating complete dissimilarity.
unique_terms <- unique(tidy_dt$term)
sdm <- stringdistmatrix(unique_terms, unique_terms, method = "jw", useNames = TRUE)
print(sdm[1:5,1:5])
## alway america american build courag
## alway 0.0000000 0.55238095 0.45000000 1.0000000 0.5444444
## america 0.5523810 0.00000000 0.04166667 0.5523810 0.4603175
## american 0.4500000 0.04166667 0.00000000 0.5583333 0.4722222
## build 1.0000000 0.55238095 0.55833333 0.0000000 0.5444444
## courag 0.5444444 0.46031746 0.47222222 0.5444444 0.0000000
While somewhat unsurprising, “america” and “american” are a very small distance away whereas the other displayed temrs are quite far apart! Taking a heuristic peak at the data reveals that a distance of 0.05 appears to be a good threshold for combining terms. Let’s take a closer look.
# Set diagonal to NA since it is zero
diag(sdm) <- NA
# Find columns containing a distance of one
inds <- which(apply(sdm,2,min, na.rm=T) <= 0.05)
# Convert to "distance" matrix, cluster, and plot
sdm %>%
.[inds,inds] %>%
as.dist(.) %>%
hclust(.) %>%
ggdendrogram(.) +
theme(axis.text.x = element_text(hjust = 1.25)) +
geom_hline(yintercept = 0.05)
That’s not bad, but there are a few errors. Things like “financi” and “financ” are combined together as we would like, but so are things like “remind” and “remaind.” For our purposes, this looks like it is doing more or less exactly what we want, the terms that are being grouped together by error are not exactly politically important whereas the rest of the terms being combined are.
So how do we go about combining these things together within our text data? Let’s decide to abide by the rule “if there is a match abiding by our threshold, replace all strings by the shortest match.” We might do something like this to accomplish the task (although it isn’t elegant!):
min_dist <- apply(sdm,2,min,na.rm=T)
thresh <- min_dist < 0.05
sub <- sdm[thresh,thresh]
old_terms <- colnames(sub)
out <- list()
for(i in old_terms){
out[[i]] <- rownames(sub)[which.min(sub[,i])]
}
dict <- data.frame(term1 = old_terms,
term2 = unlist(out))
dict$first_term <- nchar(dict$term1) < nchar(dict$term2)
dict$replacement <- ifelse(dict$first_term,
dict$term1,
dict$term2)
for(i in 1:nrow(tidy_dt)){
term <- tidy_dt[i,"term"]
if(term %in% dict$term1){
tidy_dt$term[i] <- dict$replacement[match(term,dict$term1)]
}
}
tidy_dt %>%
group_by(document,term) %>%
summarise(count = sum(count)) %>%
arrange(as.numeric(as.character(document))) -> tidy_dt
## `summarise()` has grouped output by 'document'. You can override using the
## `.groups` argument.
tidy_dt
## # A tibble: 10,533 × 3
## # Groups: document [500]
## document term count
## <chr> <chr> <dbl>
## 1 1 alway 1
## 2 1 america 2
## 3 1 build 1
## 4 1 courag 1
## 5 1 debt 1
## 6 1 famili 1
## 7 1 forward 1
## 8 1 generat 1
## 9 1 gratitud 1
## 10 1 keep 1
## # … with 10,523 more rows
Yay! A more sophisticated analysis might approach this issue somewhat differently, but the above is a good example of how to approach cleaning your text data to ensure entries are appropriately grouped and informative.
We can convert this to a “wide” version fairly easily, which is useful for more advanced analysis. As a bit of a sneak peak at what that sort of analysis entails, let’s create that wide data set…
colnames(tidy_dt)[1] <- "Document"
tidy_dt %>%
pivot_wider(id_cols = Document,
names_from = term,
values_from = count,
values_fill = 0) -> wide
wide
## # A tibble: 500 × 2,286
## # Groups: Document [500]
## Document alway america build courag debt famili forward generat gratitud
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 2 1 1 1 1 1 1 1
## 2 2 0 1 0 0 0 1 0 0 0
## 3 3 0 1 0 0 1 1 0 0 0
## 4 4 0 3 0 0 0 2 0 0 0
## 5 5 0 0 0 0 0 0 0 0 0
## 6 6 0 0 0 0 0 0 0 0 0
## 7 7 0 0 0 0 0 1 0 0 0
## 8 8 0 0 0 0 0 0 0 0 0
## 9 9 0 0 0 0 0 0 0 0 0
## 10 10 0 1 0 0 0 0 0 0 0
## # … with 490 more rows, and 2,276 more variables: keep <dbl>, let <dbl>,
## # owe <dbl>, patriot <dbl>, sacrific <dbl>, safe <dbl>, secur <dbl>,
## # step <dbl>, strive <dbl>, unpay <dbl>, world <dbl>, worthi <dbl>,
## # come <dbl>, day <dbl>, defend <dbl>, democraci <dbl>, devot <dbl>,
## # everi <dbl>, fallen <dbl>, full <dbl>, gave <dbl>, heart <dbl>, hero <dbl>,
## # hold <dbl>, last <dbl>, love <dbl>, measur <dbl>, memori <dbl>,
## # memorialday <dbl>, one <dbl>, pay <dbl>, servicememb <dbl>, today <dbl>, …
In the above, each row is a document and each column is a term. Each cell indicates how many times a particular term appears in the document – as you can see this is somewhat sparse as each Tweet holds only a few terms within it!
And now let’s do PCA on the data really quick. Note that we won’t go over what PCA is in depth here, excellent tutorials already being quite common. The important idea is that when presented with a large amount of quantitative data, we can apply PCA to summarize the major axes of variation found within that data, effectively performing dimensionality reduction. Such techniques also form the basis for what is called unsupervised learning and recommender systems in particular.
pca_hand <- function(data){
X <- t(scale(data))
A <- X %*% t(X)
E <- eigen(A)
P <- t(E$vectors)
new <- t(P %*% X)
new
}
pca_wide <- pca_hand(wide[,2:ncol(wide)])
We can plot the first principal components against each other to see, heuristically, whether there are any clusters of Tweets or Tweets which “stand out” in some way:
plot(pca_wide[,1], pca_wide[,2])
Oof! There appears to be a cluster containing the majority of texts and a small number of extreme outliers on each dimension. There is one Tweet which really stands out on the first dimension. Let’s grab it:
which_tweet <- as.numeric(wide[which.max(abs(pca_wide[,1])),"Document"])
tweets[which_tweet,"data.text"]
## [1] "Hace un año, las vidas de 19 estudiantes y dos maestras fueron arrebatadas a sangre fría en una masacre en Robb Elementary School. Como país, seguimos rezando por las víctimas, los sobrevivientes, sus familias y las comunidades que cargan con el dolor de esta tragedia."
Well that would explain it – it’s in a foreign language! What about along the other dimension?
which_tweet <- as.numeric(wide[which.max(abs(pca_wide[,2])),"Document"])
tweets[which_tweet,"data.text"]
## [1] "Anna May Wong was a dazzling, trailblazing talent on the silver screen & a courageous advocate for representation in cinema — inspiring generations of AAPI actors.\n\nIt’s fitting her likeness will grace our quarter beside an all-American creed: E Pluribus Unum, “from many, one.” https://t.co/Cej7U45MSv"
This tweet is also fairly strange, being more about cinema than anything political.
These Tweets stand out from the rest by including a combination of words not used by many other Tweets. One thing worth note is that, for exploratory purposes, conducting PCA on the entire document-term matrix is a bit slow. A way around this, which just so happens to allow us to practice some wrangling skills, is to remove columns which have only a few positive entries. This effectively removes those weird Tweets identified above from the analysis.
hist(colSums(wide[,-1]),breaks=50)
For the purposes of our example, let’s only look at those terms for which there are three or more instances.
sub <- wide[,c(1,(which(colSums(wide[,-1]) > 3)+1))]
pca_sub <- pca_hand(sub[,-1])
plot(pca_sub[,1],pca_sub[,2])
Now that looks a bit less extreme, but with some notable patterns. Let’s look at a few Tweets. First, the Tweet in the top left:
which_tweet <- as.numeric(unlist(sub[pca_sub[,1] < -5 & pca_sub[,2] > 5,"Document"]))
tweets[which_tweet,"data.text"]
## [1] "Today, at the First Parliamentary Summit of the International Crimea Platform, America and our allies sent an unmistakable statement to Putin: the free world is united in our unshakeable support for the people of Ukraine. https://t.co/oJF3FnUS2l"
## [2] "Today, it was my high honor to address the First Parliamentary Summit of @CrimeaPlatform, at the invitation of Speaker @R_Stefanchuk.\n\nMy message was simple: America and our allies pledged to stand with Ukraine until victory is won – and that is what we will do. https://t.co/VbPwNBXsf5"
## [3] "Join me in Zagreb, Croatia at the First Parliamentary Summit of @CrimeaPlatform to convey a statement of America’s fierce commitment to Ukraine’s fight for freedom.\n\nToday, we are affirming that we will be with the Ukrainian people until victory is won. https://t.co/bcvuGkw9bV"
## [4] "It’s my honor to represent the United States at the First Parliamentary Summit of the International @CrimeaPlatform.\n\nIn Zagreb, our European allies and global partners have gathered to send an unmistakable message: the free world is united in our unbreakable support for Ukraine. https://t.co/xtxM0TUjKW"
## [5] "It was an honor to meet with PM @AndrejPlenkovic and Speaker Gordan Jandroković. Croatia is a valued American ally and a key partner in peace and stability in Europe, including in energy, security and our global response to Russia’s aggression against Ukraine."
## [6] "Today, thanks to @ZelenskyyUa & @R_Stefanchuk, it was a privilege to attend the First Parliamentary Summit of the International @CrimeaPlatform in Croatia. It is a tribute to the broad & urgent global support for Ukraine that more than 50 nations are participating in this summit. https://t.co/agX8lGBacJ"
## [7] "Today, I met with Croatian Prime Minister @AndrejPlenkovic & Foreign Minister @GrlicRadman: top officials of a valued U.S. ally & key regional leader.\n\nWe discussed how our nations can continue advancing security & stability in Europe, especially through our support for Ukraine. https://t.co/I7MZimNea2"
## [8] "It was a privilege to meet with Speaker of the Croatian Parliament Gordan Jandroković, whose government is hosting the First Parliamentary Summit of the International Crimea Platform.\n\nIn our meeting, we reaffirmed our shared commitment to stand with Ukraine until victory is won. https://t.co/3Tru0fNwlT"
## [9] "Join Prime Minister @AndrejPlenkovic and me for a press conference in Zagreb, Croatia on our nations’ support for the Ukrainian people and our shared goals at the upcoming First Parliamentary Summit of the International @CrimeaPlatform. https://t.co/Hk66hTYhvt"
## [10] "Join Speaker Gordan Jandroković and me in Zagreb, Croatia for remarks on the important relationship between the U.S. and Croatia, the First Parliamentary Summit of the International Crimea Platform and our shared commitment to Ukraine. https://t.co/37Y3F8QNiA"
## [11] "As we will discuss this week, more must be done to defend democracy.\n\nThank you to Speaker Jandroković & Speaker @R_Stefanchuk, and Prime Minister @AndrejPlenkovic & President @ZelenskyyUa, for convening parliamentary leaders from the free world for this timely, important Summit."
## [12] "As Speaker, it’s my privilege to represent the United States at the First Parliamentary Summit of the International Crimea Platform.\n\nAlongside European allies and global partners, we will deliver an unmistakable statement of our solidarity with Ukraine in its fight for freedom."
All of these have to do with Croatia, Ukraine, and the First Parliametary Summit of the International Crimea Platform. That’s a lot of the same words being used, hence the similarity of the coordinates of the texts!
Let’s compare to the Tweet on the opposite side of that dimension (top right):
which_tweet <- as.numeric(unlist(sub[pca_sub[,1] > 5 & pca_sub[,2] > 5,"Document"]))
tweets[which_tweet,"data.text"]
## [1] "Republicans are holding middle class families hostage to pass their extreme MAGA agenda and give tax cuts for the wealthy.\n\nA default on America's debt would eliminate jobs, increase housing costs and threaten retirement plans. https://t.co/WKTLQIad23"
## [2] "It brings fairness to our tax code with the Child Tax Credit, while protecting Social Security, Medicare & Medicaid – in danger of being cut by Republicans.\n\nIt invests in America’s care economy: including child care, paid family & medical leave & housing. https://t.co/D7zOQMK1NG"
## [3] "Making them law will be a top priority for a Dem Majority next Congress. In stark contrast, MAGA Republicans’ extreme agenda would make inflation much worse: plotting to repeal lower prescription drug costs, give tax breaks to the ultra-rich & slash Social Security and Medicare."
## [4] "The #InflationReductionAct slashes the cost of prescription drugs, locks in lower health care premiums, cuts energy bills and reduces the deficit — while making the wealthiest few and big corporations pay their fair share. \n\nEvery single Republican voted No on this law."
## [5] "Democrats are creating jobs, cutting the deficit and making corporations pay their fair share.\n\nIn contrast, Republicans want to explode the deficit with massive tax cuts for the rich and big corporations, while slashing Social Security and Medicare. https://t.co/AhyRK4aTgI"
## [6] ".@HouseDemocrats are laser focused on widening the path to prosperity for America’s families: lowering costs and creating better-paying jobs.\n \nMeanwhile, @HouseGOP members are focused on their extreme MAGA agenda to slash Social Security and Medicare.\n \nThe contrast is clear."
## [7] "Republicans continue to show us their plans to cut Medicare and Social Security, take away a woman's right to choose and repeal lower prescription drug costs. \n\nMeanwhile, @HouseDemocrats have passed legislation to lower costs for families and take bold action on climate change. https://t.co/S7vKw5SlGB"
## [8] "With the #InflationReductionAct, @HouseDemocrats have taken on the special interests: lowering drug prices, slashing health insurance premiums and cutting energy bills For The People.\n\nEvery Member of @HouseGOP voted against it."
## [9] ".@HouseGOP continues to double down on a dangerous extreme MAGA agenda to criminalize women’s health care, slash seniors’ Medicare and attack our free and fair elections. \n \nWhile Republicans continue to push policies For Their Power, @HouseDemocrats are #ForThePeople."
## [10] "Social Security and Medicare are as important now for America’s seniors as ever.\n\nYet extreme MAGA Republicans are doubling down on their long-held goal of gutting these vital initiatives — plotting to slash benefits and raise costs for seniors."
## [11] "The average retiree benefit will increase by $146 per month next year, putting more money in American seniors’ pockets and protecting them from rising costs. Democrats will continue our work to protect Social Security and Medicare. #PeopleOverPolitics\nhttps://t.co/VLavLNFbKe"
## [12] "Republicans are pledging to repeal the lower drug costs and 158 House Republicans have already endorsed an extreme MAGA plan to privatize Social Security, raise the retirement age and end Medicare as we know it."
## [13] "Yet with Social Security & Medicare as essential as ever, extreme MAGA Republicans are openly plotting new schemes to slash seniors’ benefits & raise their costs – including by threatening to cause an economic catastrophe by holding the debt limit hostage for their toxic agenda."
## [14] "The increase in Social Security benefits, together with the historic action that Democrats have delivered to lower seniors’ prescription drug costs with the Inflation Reduction Act, shows the vital role Medicare and Social Security provide to protect seniors from rising costs."
These are all domestic in nature! So we might identify the first principal component as picking up on an international-domestic dimension of her speach.
Now let’s look at dimension 2 by comparing those in the middle of the cluster to those at the bottom. First, those at the top of the dimension (which we have seen before)
which_tweet <- as.numeric(unlist(sub[pca_sub[,2] > 8,"Document"]))
tweets[which_tweet,"data.text"]
## [1] "Making them law will be a top priority for a Dem Majority next Congress. In stark contrast, MAGA Republicans’ extreme agenda would make inflation much worse: plotting to repeal lower prescription drug costs, give tax breaks to the ultra-rich & slash Social Security and Medicare."
## [2] "It’s my honor to represent the United States at the First Parliamentary Summit of the International @CrimeaPlatform.\n\nIn Zagreb, our European allies and global partners have gathered to send an unmistakable message: the free world is united in our unbreakable support for Ukraine. https://t.co/xtxM0TUjKW"
## [3] "It was an honor to meet with PM @AndrejPlenkovic and Speaker Gordan Jandroković. Croatia is a valued American ally and a key partner in peace and stability in Europe, including in energy, security and our global response to Russia’s aggression against Ukraine."
## [4] "Today, I met with Croatian Prime Minister @AndrejPlenkovic & Foreign Minister @GrlicRadman: top officials of a valued U.S. ally & key regional leader.\n\nWe discussed how our nations can continue advancing security & stability in Europe, especially through our support for Ukraine. https://t.co/I7MZimNea2"
## [5] "It was a privilege to meet with Speaker of the Croatian Parliament Gordan Jandroković, whose government is hosting the First Parliamentary Summit of the International Crimea Platform.\n\nIn our meeting, we reaffirmed our shared commitment to stand with Ukraine until victory is won. https://t.co/3Tru0fNwlT"
## [6] "Join Prime Minister @AndrejPlenkovic and me for a press conference in Zagreb, Croatia on our nations’ support for the Ukrainian people and our shared goals at the upcoming First Parliamentary Summit of the International @CrimeaPlatform. https://t.co/Hk66hTYhvt"
## [7] "Join Speaker Gordan Jandroković and me in Zagreb, Croatia for remarks on the important relationship between the U.S. and Croatia, the First Parliamentary Summit of the International Crimea Platform and our shared commitment to Ukraine. https://t.co/37Y3F8QNiA"
## [8] "Democrats are creating jobs, cutting the deficit and making corporations pay their fair share.\n\nIn contrast, Republicans want to explode the deficit with massive tax cuts for the rich and big corporations, while slashing Social Security and Medicare. https://t.co/AhyRK4aTgI"
## [9] ".@HouseDemocrats are laser focused on widening the path to prosperity for America’s families: lowering costs and creating better-paying jobs.\n \nMeanwhile, @HouseGOP members are focused on their extreme MAGA agenda to slash Social Security and Medicare.\n \nThe contrast is clear."
## [10] "Republicans continue to show us their plans to cut Medicare and Social Security, take away a woman's right to choose and repeal lower prescription drug costs. \n\nMeanwhile, @HouseDemocrats have passed legislation to lower costs for families and take bold action on climate change. https://t.co/S7vKw5SlGB"
## [11] "Social Security and Medicare are as important now for America’s seniors as ever.\n\nYet extreme MAGA Republicans are doubling down on their long-held goal of gutting these vital initiatives — plotting to slash benefits and raise costs for seniors."
## [12] "Yet with Social Security & Medicare as essential as ever, extreme MAGA Republicans are openly plotting new schemes to slash seniors’ benefits & raise their costs – including by threatening to cause an economic catastrophe by holding the debt limit hostage for their toxic agenda."
## [13] "The increase in Social Security benefits, together with the historic action that Democrats have delivered to lower seniors’ prescription drug costs with the Inflation Reduction Act, shows the vital role Medicare and Social Security provide to protect seniors from rising costs."
On their own there isn’t anything that stands out as common across these texts. Let’s look at what is going on at the other end of the spectrum:
which_tweet <- as.numeric(unlist(sub[pca_sub[,2] < -6,"Document"]))
tweets[which_tweet,"data.text"]
## [1] "One year ago, 19 precious schoolchildren and 2 devoted educators were stolen in a cold-blooded massacre at Robb Elementary School. Americans continue to pray for the victims, survivors, their families and the community burdened with unspeakable grief from this senseless tragedy."
## [2] "In the wake of this massacre, we enacted our Bipartisan Safer Communities Act: a strong step to reduce gun violence. Now, we must reinstate the Assault Weapons Ban.\n \nDemocrats will never stop fighting against racism and gun violence — building a safer future for our children."
## [3] "One year ago today, America watched in horror as a hate-fueled attack stole ten beautiful souls in Buffalo. We continue to mourn the victims of this racist shooting and their families.\n \nIn the year since, the Buffalo community has shown extraordinary unity and resilience."
## [4] "Sickened by the mass shooting in Allen that terrorized families spending their afternoon at the mall.\n\nOur prayers are with the beautiful souls stolen, the victims wounded and their devastated loved ones, and we salute the heroic first responders. \n\nWe must stop the bloodshed."
## [5] "May it be a comfort to his loving wife Pamela, his dear children and step-children, his many beloved grandchildren, and his entire family that so many around the world mourn their loss and are praying for them during this sad time. \n\nhttps://t.co/VPfRXqzEXN"
## [6] "No one should live in fear of gun violence – in our schools, on the job or in our communities.\n\nAfter yet another deadly mass shooting, sadly still in the Holy Season, our prayers and our hearts are with the victims, their families and the Louisville community.\n\nEnough is enough."
## [7] "My prayers are with the families who have lost a loved one in the violent, horrific mass shooting at Covenant School today and with the entire Nashville community.\n \nThe assault weapons ban must be reinstated to protect our children.\n \nLet us pray! Let us act!"
## [8] "Her courage and persistence leave behind a legacy of progress and have inspired countless women in public service to follow in her footsteps.\n\nMay it be a comfort to the entire Schroeder family that so many mourn with and pray for them at this sad time. https://t.co/ZyhsFVRDp3"
## [9] "I am always moved by Pope Benedict’s encyclical, “God is Love,” where he quotes St. Augustine highlighting our duty as public servants to fight for justice.\n\nMay it be comfort to Pope Francis and the Vatican community that so many pray for Pope Benedict during this sad time."
## [10] "More must be done to save lives – which is why @HouseDemocrats have passed legislation to reinstate the Assault Weapons Ban, more efficiently warn communities during an active shooting and secure universal background checks. We will not rest until these vital measures become law."
## [11] "World AIDS Day is a solemn opportunity to reflect on the devastating toll that HIV/AIDS has wreaked on communities around the globe. \n\nToday, and every day, we remember the beautiful souls stolen away by this vicious disease and comfort the grieving loved ones left behind."
## [12] "Paul is grateful to the 911 operator, emergency responders, trauma care team, ICU staff, and the entire @ZSFGCare medical staff for their excellent and compassionate life-saving treatment he received after the violent assault in our home. https://t.co/nEY8ISANI8"
Gun violence and death everywhere! So we might call this second dimension something like “peaceful-violent.”
What sort of texts would appear in the empty quadrant around (-10,-10)? Based on the above, we might imagine it to be something like international violence. Let’s try this out by creating a PCA reconstruction of those coordinates.
coords <- c(-10,-10,rep(0,ncol(pca_sub)-2))
X <- t(scale(sub[,2:ncol(sub)]))
A <- X %*% t(X)
E <- eigen(A)
P <- t(E$vectors)
val <- as.numeric((coords %*% P + colMeans(sub[,-1]))*apply(sub[,-1], 2, sd))
terms <- colnames(sub[,-1])
reconstruction <- data.frame(term = terms,
count = val)
Let’s look at the top words to be used in such a Tweet:
reconstruction %>%
arrange(desc(count)) %>%
head()
## term count
## 1 communiti 0.4673746
## 2 day 0.3501247
## 3 one 0.3254363
## 4 today 0.3114023
## 5 love 0.3044455
## 6 violenc 0.2538339
We get a combination of community, sending good vibes towards children, mentions of gun violence, honor, mourning, comfort, as well as a few terms related to the world (especially further down). If we look at the bottom of the list, we see things to not mention should we want to create a Tweet in that space:
reconstruction %>%
arrange(count) %>%
head()
## term count
## 1 cost -0.9704466
## 2 republican -0.7026375
## 3 lower -0.6334035
## 4 secur -0.4757435
## 5 social -0.4708345
## 6 medicar -0.4696433
Republicans, costs, social security – a whole litany of domestic policy and politics terms not related to international relations or gun violence. Pretty cool!
The final topic we will cover today deals with summarizing the corpus and visualizing the main themes using a word cloud.
One thing which might be of immediate interest is what words occur most and least frequently within this corpus. To see the most frequently used terms we can do the following:
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) %>%
arrange(desc(occurances))
## # A tibble: 2,285 × 2
## term occurances
## <chr> <dbl>
## 1 america 158
## 2 famili 103
## 3 today 103
## 4 work 100
## 5 will 91
## 6 year 79
## 7 fight 78
## 8 join 78
## 9 communiti 76
## 10 law 76
## # … with 2,275 more rows
To see the least frequently used terms we can likewise do
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) %>%
arrange(occurances)
## # A tibble: 2,285 × 2
## term occurances
## <chr> <dbl>
## 1 abduct 1
## 2 abject 1
## 3 abl 1
## 4 ablaz 1
## 5 abram 1
## 6 academ 1
## 7 acclaim 1
## 8 accord 1
## 9 actor 1
## 10 add 1
## # … with 2,275 more rows
A popular technique for qualitative assessment is to use a word cloud:
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) -> counts
wordcloud(counts$term,counts$occurances,max.words = 100,random.order = F)
And we can now put all of these things together for our subsets mentioning democrats and republicans from above! Let’s write a little function to do this for us:
tweet_cloud <- function(tweet_vec){
tweet_vec %>%
gsub("http.*","",.) %>%
gsub('[[:punct:] ]+',' ',.) %>%
gsub('\\n','',.) -> tweets_clean
stopwords <- c("amp",stopwords("English"))
for(i in stopwords){
tweets_clean <- gsub(paste0("\\b",i,"\\b"), "", tweets_clean)
}
corpus <- VCorpus(VectorSource(tweets_clean))
dt <- DocumentTermMatrix(corpus)
tidy_dt <- tidy(dt)
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) -> counts
wordcloud(counts$term,counts$occurances,max.words = 100,random.order = F)
}
Now let’s run it for the democratic tweets:
tweet_cloud(tweets_dem$data.text)
And the republican tweets:
tweet_cloud(tweets_rep$data.text)
There sure seems to be a qualitative difference! This will, coming up, lead us into the domain of sentiment analysis and topic modeling in a few days!
Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:
Write a script which…
Dowloads the following Tweets (saved in RDS format) and reads them into R:
Search for three politically relevant keywords or combinations of keywords in this corpus. How many Tweets contain each?
Clean the corpus and create a document-term matrix. Cleaning should include removing punctuation, numbers, stopwords, stemming words, combining similar words, etc, as shown above.
Conduct PCA on the full document-term matrix and select some Tweets at the extremes. What do they say and how do they differ?
Subset your document-term matrix to somewhat more commonly used terms only. Conduct PCA and attempt to interpret the first two dimensions.
Create a wordcloud of the cleaned tweets for the politically relevant keywords you selected. Are there any qualitative differences between them?
Save and submit your working R script to the Exercise/Quiz Submission Link by the end of the day (ideally, end of lab session!).