Today we are going to continue our discussion of Twitter by considering how we might begin analyzing Tweets in a more sophisticated manner. To do so we need to learn some fundamentals of cleaning text within R.
To clean text we first need some text. Let’s begin by re-loading some functions from last time and collecting some Tweets to process!
library(httr)
library(jsonlite)
library(dplyr)
library(plyr)
library(stringr)
library(tidyr)
library(tm)
library(tidytext)
library(wordcloud)
wd <- "D:/Twitter"
setwd(wd)
twitter_info <- read.csv("twitter_info2.csv",
stringsAsFactors = F)
last_n_tweets <- function(bearer_token = "", user_id = "", n = 100,
tweet_fields = c("attachments",
"created_at",
"entities",
"in_reply_to_user_id",
"public_metrics",
"referenced_tweets",
"source")){
headers <- c(`Authorization` = sprintf('Bearer %s', bearer_token))
# Convert User ID into Numerical ID
sprintf('https://api.twitter.com/2/users/by?usernames=%s', user_id) %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list()) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> tmp
num_id <- tmp$data.id
# For that user, grab most recent n tweets, in batches of 100
if(n <= 100){
requests <- n
}else{
requests <- rep(100,floor(n/100))
if(n %% 100 != 0){
requests <- c(requests, n %% 100)
}
}
next_token <- NA
all <- list()
# Initialize, grab first results
paste0('https://api.twitter.com/2/users/',num_id,'/tweets') %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list(`max_results` = requests[1],
tweet.fields = paste(tweet_fields,collapse=","))) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> out
all[[1]] <- out
# For more than 100, need to use pagination tokens.
if(length(requests) >= 2){
next_token[2] <- unique(as.character(all[[1]]$meta.next_token))
for(i in 2:length(requests)){
paste0('https://api.twitter.com/2/users/',num_id,'/tweets') %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list(`max_results` = requests[i],
tweet.fields = paste(tweet_fields,collapse=","),
pagination_token = next_token[i])) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> out
all[[i]] <- out
next_token[i + 1] <- unique(as.character(all[[i]]$meta.next_token))
}
}
do.call("rbind.fill",all)
}
For our purposes it will be sufficient to grab the past 500 Tweets from, say, Nancy Pelosi.
tweets <- last_n_tweets(bearer_token = twitter_info$bearer_token,
user_id = "SpeakerPelosi",
n= 500,
tweet_fields = c("created_at","public_metrics"))
One of the most basic things that we might want to do when facing a corpus of Tweets is determine whether or not they contain particular phrases or references to particular entities. For this, we can use the grep and grepl functions within R. For example, if we wanted to see what Tweets contain the term “Democrat” we can do the following:
grep("Democrat",
tweets$data.text,
ignore.case = T)
## [1] 11 13 15 17 20 33 34 35 39 43 44 52 55 64 65 70 83 85
## [19] 91 95 107 110 120 135 136 137 142 146 147 156 157 159 195 208 212 216
## [37] 225 227 238 239 247 249 250 251 256 257 258 260 262 265 270 271 272 276
## [55] 279 282 283 285 302 303 304 305 306 307 308 311 319 328 330 331 332 333
## [73] 338 339 340 356 359 375 377 381 382 396 405 412 413 415 418 423 427 428
## [91] 429 431 455 456 458 459 461 463 464 465 480 488 492 497 499 500
This supplies a vector of index values referring to Tweets which match the search criteria. If we used grepl instead, we get the following output:
grepl("Democrat",
tweets$data.text,
ignore.case = T)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## [37] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [85] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [109] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [145] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [157] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
## [253] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
## [265] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
## [277] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [337] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [421] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [457] FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [493] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
In my experience, grep is more useful for subsetting whereas grepl is more immediately useful for coding variables. For example, if we wanted to add a variable to our data which records whether or not a Tweet references “Democrats” we could do the following:
grepl("Democrat",
tweets$data.text,
ignore.case = T) -> tweets$democrat
Whereas if we wanted to subset down to those Tweets which contain the term “Democrat,” we could do the following:
grepl("Democrat",
tweets$data.text,
ignore.case = T) %>%
tweets[.,] -> tweets_dem
But, of course there are multiple ways to accomplish the same task within R. For example, if we wanted to know which Tweets mention “Trump” or “Republicans,” we could do the following:
grepl("(Trump|Republican)",
tweets$data.text,
ignore.case = T) %>%
tweets[.,] -> tweets_rep
Great. Now lets’s take a look at a random Tweet mentioning “Democrats.”
Let’s look at a random Tweet from this set, as an example:
tweets_dem$data.text[sample(1:nrow(tweets_dem),1)]
## [1] "And with our Bipartisan Infrastructure Law, the Democratic Congress took action to advance environmental justice for all — from getting the lead out of our children’s drinking water to cleaning up legacy pollution disproportionately harming communities of color."
Regardless of the Tweet found above (and it was random!), a few things will be worth note. First, there are a number of words which – while important for human readability, are so common as to be useless for text analysis. These terms are known as stop words and generally removed prior to analysis being conducted. Second, the end of the Tweet contains a link to the Tweet itself. We will want to remove this as well. The same goes for punctuation and special characters which, for our purposes, don’t matter.
Let’s start with removing that link. To accomplish this, we will use the gsub function.
tweets$data.text %>%
gsub("http.*","",.) -> tweets$tweets_clean
tweets$tweets_clean[1]
## [1] "Join the Asian Pacific Islander Council and me in San Francisco for a roundtable discussing community project funding, immigration, anti-AAPI hate efforts and equitable recovery for working families and essential workers. "
Sweet. Let’s get rid of punctuation next:
tweets$tweets_clean %>%
gsub('[[:punct:] ]+',' ',.) -> tweets$tweets_clean
tweets$tweets_clean[1]
## [1] "Join the Asian Pacific Islander Council and me in San Francisco for a roundtable discussing community project funding immigration anti AAPI hate efforts and equitable recovery for working families and essential workers "
There is still that pesky new-line marker, which we can remove in a similar manner:
tweets$tweets_clean %>%
gsub('\\n','',.) -> tweets$tweets_clean
tweets$tweets_clean[1]
## [1] "Join the Asian Pacific Islander Council and me in San Francisco for a roundtable discussing community project funding immigration anti AAPI hate efforts and equitable recovery for working families and essential workers "
Stopwords are a little bit harder to do manually as there are many of them. Luckily, common dictionaries exist.
stopwords("English")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
Now how do we go about removing these from the Tweets? We want to to this in as much of a “base R” way as possible, so let’s just use what we have learned above. First, note that everything in that list is lowercase, so let’s coerce everything in our clean tweets to be lowercase themselves.
tweets$tweets_clean <- tolower(tweets$tweets_clean)
tweets$tweets_clean[1]
## [1] "join the asian pacific islander council and me in san francisco for a roundtable discussing community project funding immigration anti aapi hate efforts and equitable recovery for working families and essential workers "
To get a little bit of practice using R, we will do this a somewhat silly way. In particular, let’s loop over each of these stopwords and remove exact matches:
stopwords <- c("amp",stopwords("English"))
for(i in stopwords){
tweets$tweets_clean <- gsub(paste0("\\b",i,"\\b"), "", tweets$tweets_clean)
}
tweets$tweets_clean[1]
## [1] "join asian pacific islander council san francisco roundtable discussing community project funding immigration anti aapi hate efforts equitable recovery working families essential workers "
If you wanted to remove additional words you could easily do so by expanding the “stopwords” vector created above. Alternatively, if you wanted to remove other things (like numbers) you can continue this process using regular expressions as demonstrated above.
Now that we have some relatively clean text, we might want to compute summaries or visualize word frequencies. To do this, we need to form what is known as a document-term matrix where each row is a Tweet and each column is a term or word. For this we will use the tm package and create our first corpus:
corpus <- VCorpus(VectorSource(tweets$tweets_clean))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 500
To access information on the corpus elements, you can treat it like a list!
corpus[[1]]$content
## [1] "join asian pacific islander council san francisco roundtable discussing community project funding immigration anti aapi hate efforts equitable recovery working families essential workers "
Now let’s make the document term matrix:
dt <- DocumentTermMatrix(corpus)
tidy_dt <- tidy(dt)
tidy_dt
## # A tibble: 10,033 × 3
## document term count
## <chr> <chr> <dbl>
## 1 1 aapi 1
## 2 1 anti 1
## 3 1 asian 1
## 4 1 community 1
## 5 1 council 1
## 6 1 discussing 1
## 7 1 efforts 1
## 8 1 equitable 1
## 9 1 essential 1
## 10 1 families 1
## # … with 10,023 more rows
We can also convert this to a “wide” version fairly easily, which is useful for more advanced analysis. As a bit of a sneak peak at what that sort of analysis entails, let’s create that wide data set…
tidy_dt %>%
spread(term,count,fill=0) -> wide
wide
## # A tibble: 500 × 2,991
## document ` national` ` 2019` ` achievements` ` artists` ` read` ` across`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0 0 0 0
## 2 10 0 0 0 0 0 0
## 3 100 0 0 0 0 0 0
## 4 101 0 0 0 0 0 0
## 5 102 0 0 0 0 0 0
## 6 103 0 0 0 0 0 0
## 7 104 0 0 0 0 0 0
## 8 105 0 0 0 0 0 0
## 9 106 0 0 0 0 0 0
## 10 107 0 0 0 0 0 0
## # … with 490 more rows, and 2,984 more variables: ` america` <dbl>,
## # ` arriving` <dbl>, ` bipartisan` <dbl>, ` building` <dbl>, ` césar` <dbl>,
## # ` chávez` <dbl>, ` city` <dbl>, ` democrats` <dbl>, ` europe ` <dbl>,
## # ` generations` <dbl>, ` including` <dbl>, ` infrastructure` <dbl>,
## # ` january` <dbl>, ` jewish` <dbl>, ` joined` <dbl>, ` just` <dbl>,
## # ` last` <dbl>, ` let` <dbl>, ` many` <dbl>, ` may` <dbl>,
## # ` meanwhile` <dbl>, ` medium post` <dbl>, ` nearly` <dbl>, ` now` <dbl>, …
And now let’s do PCA on the data really quick. Note that we won’t go over what PCA is in depth here, excellent tutorials already being quite common. The important idea is that when presented with a large amount of quantitative data, we can apply PCA to summarize the major axes of variation found within that data, effectively performing dimensionality reduction. Such techniques also form the basis for what is called unsupervised learning and recommender systems in particular.
pca_hand <- function(data){
X <- t(scale(data))
A <- X %*% t(X)
E <- eigen(A)
P <- t(E$vectors)
new <- t(P %*% X)
new
}
pca_wide <- pca_hand(wide[,2:ncol(wide)])
We can plot the first principal components against each other to see, heuristically, whether there are any clusters of Tweets or Tweets which “stand out” in some way:
plot(pca_wide[,1], pca_wide[,2])
There is one Tweet which really stands out on the first dimension. Let’s
grab it:
which_tweet <- as.numeric(wide[which.min(pca_wide[,1]),"document"])
tweets[which_tweet,"data.text"]
## [1] "Llevamos en el corazón a todos los que han perdido a un ser querido a causa de la violencia armada, ya que este horrible crimen agrava su sufrimiento. Estamos unidos en nuestra profunda gratitud por los héroes que arriesgaron sus vidas para responder a este mortal tiroteo masivo."
Ah! That would explain it – it’s in a different language! What about the Tweet which is very far off along dimension 2?
which_tweet <- as.numeric(wide[which.max(pca_wide[,2]),"document"])
tweets[which_tweet,"data.text"]
## [1] "Es hora de que todos los miembros del Congreso presten atención a la voluntad del pueblo estadounidense y se unan para promulgar la legislación bipartidista aprobada por la Cámara de Representantes la cual salvara vidas."
Same problem! Kind of neat that we were able to find such things. Were this a proper analysis, we could use this sort of anomaly detection to re-filter the analysis as needed before moving onto results. One thing worth note is that, for exploratory purposes, conducting PCA on the entire document-term matrix is a bit slow. A way around this, which just so happens to allow us to practice some wrangling skills, is to remove columns which have only a few positive entries. This effectively removes those weird Tweets identified above from the analysis.
hist(colSums(wide[,-1]),breaks=50)
For the purposes of our example, let’s only look at those terms for
which there are three or more instances.
sub <- wide[,c(1,(which(colSums(wide[,-1]) > 3)+1))]
pca_sub <- pca_hand(sub[,-1])
plot(pca_sub[,1],pca_sub[,2])
Now that looks a bit less extreme, but with some notable patterns. Let’s look at a few Tweets. First, the Tweet in the bottom left:
which_tweet <- as.numeric(sub[which.min(pca_sub[,2]),"document"])
tweets[which_tweet,"data.text"]
## [1] "Today, our delegation was honored to meet with @AndrzejDuda: a valued partner in supporting Ukraine in the face of Putin’s brutal war. \n\nWe expressed America’s gratitude to Poland for opening hearts & homes to refugees and reaffirmed our commitment to our nations’ partnership. https://t.co/Ij0VhuCBgl"
Let’s compare to the Tweet on the opposite side of that dimension:
which_tweet <- as.numeric(sub[which.max(pca_sub[,2]),"document"])
tweets[which_tweet,"data.text"]
## [1] "Every woman, everywhere, has the Constitutional right to basic reproductive health care. House Democrats will never relent in defending health freedoms across our nation – wherever and whenever they come under threat. Read my full statement here: https://t.co/Hm82aerU4d"
Quite different! Now let’s look at dimension 1.
which_tweet <- as.numeric(sub[which.min(pca_sub[,1]),"document"])
tweets[which_tweet,"data.text"]
## [1] "As Speaker of the House, it was my privilege to welcome His Excellency Kyriakos Mitsotakis, Prime Minister of the Hellenic Republic, to the United States Capitol this morning before a bilateral meeting and his address to the Joint Session of Congress. https://t.co/E1gItQ6kr3"
And on the opposite end:
which_tweet <- as.numeric(sub[which.max(pca_sub[,1]),"document"])
tweets[which_tweet,"data.text"]
## [1] "The March job report shows Democrats’ economic strategy continues to power a strong jobs recovery. Since @POTUS took office, our nation has created 7.9 million new jobs, slashed the unemployment rate to 3.6% & workers’ wages are rising with the help of our #AmericanRescuePlan."
Again, quite the substantive difference! But let’s see what loads “close” together:
which_tweets <- as.numeric(unlist(sub[pca_sub[,1] > 12,"document"]))
tweets[which_tweets,"data.text"]
## [1] "With the help of our #AmericanRescuePlan, our nation slashed the unemployment rate to 3.6% – near pre-pandemic levels. To build on this progress, Democrats remain laser-focused on lowering costs for working families as they face Putin's Price Hike. https://t.co/ooCkZo8nMB"
## [2] "As last month's jobs report showed, our nation has built a strong recovery: creating 7.9 million new jobs since @POTUS took office.\n\nDemocrats are focusing on lowering costs, growing paychecks and creating jobs for families while ensuring the richest few pay their fair share."
## [3] "The March job report shows Democrats’ economic strategy continues to power a strong jobs recovery. Since @POTUS took office, our nation has created 7.9 million new jobs, slashed the unemployment rate to 3.6% & workers’ wages are rising with the help of our #AmericanRescuePlan."
## [4] "Democrats are focused on Building a Better America: making more goods in America, lowering the costs of energy, child care, prescription drugs and more, bolstering competition to deliver better prices for consumers, and creating more good-paying union jobs across the country. https://t.co/66qU8Qhkzo"
Pretty similar substantively! Hopefully it is apparent how such techniques can be used to cluster text together in an unsupervised manner and form the basis for recommender systems which, when given an input, recommends something similar.
The final topic we will cover today deals with summarizing the corpus and visualizing the main themes using a word cloud.
One thing which might be of immediate interest is what words occur most and least frequently within this corpus. To see the most frequently used terms we can do the following:
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) %>%
arrange(desc(occurances))
## # A tibble: 2,990 × 2
## term occurances
## <chr> <dbl>
## 1 will 97
## 2 america 88
## 3 today 88
## 4 join 85
## 5 families 73
## 6 congress 71
## 7 nation 68
## 8 ukraine 68
## 9 house 66
## 10 americans 61
## # … with 2,980 more rows
To see the least frequently used terms we can likewise do
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) %>%
arrange(occurances)
## # A tibble: 2,990 × 2
## term occurances
## <chr> <dbl>
## 1 national 1
## 2 2019 1
## 3 achievements 1
## 4 artists 1
## 5 read 1
## 6 across 1
## 7 america 1
## 8 arriving 1
## 9 bipartisan 1
## 10 building 1
## # … with 2,980 more rows
Even if not particularly informative, these observations lead to the observation of a few potential issues within our current corpus! For example, we might now think about how to stem words so that terms like “america,” “american,” and “americans” are not distinct entries. Likewise, we might address potential issues with leading/trailing whitespace or remove numbers, etc. These are all decisions you’ll have to make when analyzing text!
But now that we have something that we might work with, we can start visualizing the results. A popular technique for qualitative assessment is to use a word cloud:
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) -> counts
wordcloud(counts$term,counts$occurances,max.words = 100,random.order = F)
And we can now put all of these things together for our subsets mentioning democrats and republicans from above! Let’s write a little function to do this for us.
tweet_cloud <- function(tweet_vec){
tweet_vec %>%
gsub("http.*","",.) %>%
gsub('[[:punct:] ]+',' ',.) %>%
gsub('\\n','',.) -> tweets_clean
stopwords <- c("amp",stopwords("English"))
for(i in stopwords){
tweets_clean <- gsub(paste0("\\b",i,"\\b"), "", tweets_clean)
}
corpus <- VCorpus(VectorSource(tweets_clean))
dt <- DocumentTermMatrix(corpus)
tidy_dt <- tidy(dt)
tidy_dt %>%
group_by(term) %>%
dplyr::summarise(occurances = sum(count)) -> counts
wordcloud(counts$term,counts$occurances,max.words = 100,random.order = F)
}
Now let’s run it for the democratic tweets:
tweet_cloud(tweets_dem$data.text)
And the republican tweets:
tweet_cloud(tweets_rep$data.text)
There sure seems to be a qualitative difference! This will, coming up, lead us into the domain of sentiment analysis and topic modeling in a few days!
Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:
Write a script which…
Save and submit your working R script to the Exercise/Quiz Submission Link by the end of the day (ideally, end of lab session!).