This script outlines steps taken during the creation of our immigration keyword dictionary. The keyword dictionary will be used to collect tweets discussing immigration in the 3 months leading to the 2016 Presidential Election (i.e., August 8th to November 8th).
1. First, I collected tweets with the keywords immigration, immigrant, and immigrants. Next, I randomly sampled 10,000 tweets for each month of interest.
text_token <- immigration_tweets$Contents[!duplicated(immigration_tweets$Contents)] #remove duplicates
text_token[1:5] #sample of the first 5 tweets
## [1] "RT @shezumi lazy immigrants, going to a country that's not theirs and taking things for free. https://t.co/P0ZUZvJJTt"
## [2] "PayPal cofounder Max Levchin praises White House proposal to attract immigrant entrepreneurs https://t.co/1pClubDsrk"
## [3] "RT @JaredWyand #ManyPeopleAreSaying Muslim immigration is a plot to make us choose between freedom & safety ending in control https://t.co/OqSMCo0U0W"
## [4] "Melania #Trump will address immigration controversy in next couple of weeks, D Trump says. 14 days & counting #GOP https://t.co/xEIk0xQl7N"
## [5] "RT @WeNeedTrump RETWEET if you want American citizens to come before refugees and illegal immigrants. https://t.co/LX87WxAbMY"
The tweets are read into R, but the text includes a lot of unnecessary elements. To clean the text, I removed urls, the RT notation, numbers, emojis, and white space. I also made of the words lowercase. All hashtags and @mentions remain.
text_token <- text_token %>% stringr::str_replace_all("RT *", "") %>%
stringr::str_replace_all("https://t.co/[a-z,A-Z,0-9]*", "") %>%
stringr::str_replace_all("[[:digit:]]", "") %>%
iconv("latin1", "ASCII", sub="") %>% stringr::str_squish() %>%
tolower()
text_token[1:5]
## [1] "@shezumi lazy immigrants, going to a country that's not theirs and taking things for free."
## [2] "paypal cofounder max levchin praises white house proposal to attract immigrant entrepreneurs"
## [3] "@jaredwyand #manypeoplearesaying muslim immigration is a plot to make us choose between freedom & safety ending in control"
## [4] "melania #trump will address immigration controversy in next couple of weeks, d trump says. days & counting #gop"
## [5] "@weneedtrump retweet if you want american citizens to come before refugees and illegal immigrants."
Now that the text is clean, I examined frequent words using quanteda.
token <- tokens_remove(tokens(text_token, remove_punct = T, remove_twitter = F), stopwords("english")) ### tokenize text (removing punctuation & stopwords)
Here are the tokenized words.
token[1:5]
## tokens from 5 documents.
## text1 :
## [1] "@shezumi" "lazy" "immigrants" "going" "country"
## [6] "taking" "things" "free"
##
## text2 :
## [1] "paypal" "cofounder" "max" "levchin"
## [5] "praises" "white" "house" "proposal"
## [9] "attract" "immigrant" "entrepreneurs"
##
## text3 :
## [1] "@jaredwyand" "#manypeoplearesaying" "muslim"
## [4] "immigration" "plot" "make"
## [7] "us" "choose" "freedom"
## [10] "safety" "ending" "control"
##
## text4 :
## [1] "melania" "#trump" "will" "address" "immigration"
## [6] "controversy" "next" "couple" "weeks" "d"
## [11] "trump" "says" "days" "counting" "#gop"
##
## text5 :
## [1] "@weneedtrump" "retweet" "want" "american"
## [5] "citizens" "come" "refugees" "illegal"
## [9] "immigrants"
n_grams <- tokens_ngrams(token, n = 1:2, concatenator = " ") ### unigram and bigram frequency
ngram_dfm <- dfm(n_grams) ##document frequency matrix
Grams_imm <- topfeatures(ngram_dfm,300) #top 300 unigrams and bigrams
Here are the top 300 unigrams and bigrams. I used key words associated with immigration to expand the keyword dictionary for another search. Added words included border, illegal immigrant, immigration policy, immigration reform, and border.
Grams_imm
## immigration immigrants trump
## 9959 7260 4171
## illegal immigrant illegal immigrants
## 3223 2811 1510
## via #immigration will
## 1393 1162 1149
## donald us trump's
## 1083 1062 990
## illegal immigration @realdonaldtrump new
## 899 863 861
## muslim people donald trump
## 819 684 670
## hillary policy undocumented
## 653 648 646
## clinton u.s says
## 635 635 635
## america like plan
## 618 609 607
## just vote now
## 605 542 540
## speech get illegal immigrant
## 539 513 505
## americans american #trump
## 499 499 497
## undocumented immigrants legal can
## 494 491 477
## obama immigration policy border
## 466 446 443
## country trumps want
## 436 428 424
## jobs reform one
## 422 418 416
## stop law immigration plan
## 404 388 376
## laws wants taxes
## 374 364 360
## say trump's immigration wall
## 352 351 346
## need immigration speech immigration reform
## 341 334 315
## said campaign million
## 312 311 310
## white going refugees
## 310 310 309
## policies pay w
## 303 300 299
## back #maga many
## 299 299 297
## know election melania
## 296 296 293
## right make women
## 290 288 287
## support @hillaryclinton citizenship
## 286 279 273
## news first #tcot
## 272 272 269
## citizens video muslim immigration
## 269 266 260
## security time good
## 256 256 253
## muslims #immigrants vetting
## 252 251 248
## voters @foxnews court
## 247 247 245
## mass immigration laws issue
## 244 241 240
## work mexican mexico
## 240 238 237
## think economy help
## 234 234 233
## must children trump immigration
## 231 229 229
## great year go
## 226 225 225
## still every deport
## 224 222 222
## anti-immigrant crime borders
## 221 219 218
## come black muslim immigrants
## 217 216 215
## racist open status
## 213 210 208
## report never even
## 208 207 207
## ban years watch
## 206 204 203
## donald trump's president nation
## 200 196 194
## better keep care
## 192 192 191
## take debate see
## 191 190 190
## media families usa
## 183 183 182
## state federal immigration policies
## 182 182 181
## stance story deportation
## 181 181 180
## way world system
## 180 177 176
## killed u extreme
## 173 172 171
## trump says end workers
## 171 170 170
## test @fairimmigration national
## 169 168 168
## criminal terrorism really
## 168 168 167
## supreme problem also
## 167 167 165
## tax day bad
## 165 163 162
## much talk may
## 162 161 160
## give read refugee
## 160 160 160
## let times today
## 160 158 158
## hillary clinton gop supreme court
## 157 157 156
## issues illegals call
## 155 155 155
## #debatenight supporters deported
## 155 154 154
## big government bill
## 153 153 150
## breitbart history party
## 149 148 147
## detention two family
## 146 146 146
## voting shows poll
## 146 146 145
## tell position #debate
## 144 144 143
## coming immigration law @youtube
## 141 141 141
## change extreme vetting illegally
## 141 140 140
## race hispanic calls
## 138 138 138
## home man build
## 138 137 137
## melania trump made uk
## 137 136 136
## s trade countries
## 136 136 135
## another kaine working
## 135 135 134
## look since trumps immigration
## 134 132 132
## soros hate immigration via
## 132 132 130
## rights live pence
## 130 129 128
## r please fact
## 128 127 127
## real left washington
## 126 126 126
## needs plans love
## 126 126 126
## softening ever islamic
## 125 125 125
## nothing study got
## 124 124 124
## presidential dhs cnn
## 124 123 123
## last talking mass immigration
## 123 123 122
## due woman #news
## 122 121 120
## saying amnesty life
## 120 120 120
## case #trumppence yes
## 119 119 119
## post free show
## 118 117 117
## candidate enforcement private
## 117 117 116
## bring democrats lives
## 116 116 116
## trying isis stay
## 116 116 115
## billion used questions
## 115 115 115
## next job states
## 115 115 114
## arizona immigration status europe
## 113 113 113
## open borders latinos top
## 112 112 111
## obama's #hillary believe
## 111 110 110
## foreign become fox
## 110 109 109
## millions control york
## 109 109 109
## fraud b killed illegal
## 108 108 107
## thing donald trumps wrong
## 107 107 106
2. With an expanded list of keywords, I collected another sample of 30,000 tweets (10,000/month).
Again, I cleaned the text, removing unwanted items (e.g., urls, emojis, numbers)
text_token <- immigration_tweets2$Contents[!duplicated(immigration_tweets2$Contents)] #remove duplicates
text_token[1:5] #sample of the first 5 tweets
## [1] "Explosions rock Syrian border as Turkey presses operation https://t.co/Jh5kZWV0Hx"
## [2] "RT @irritatedwoman Hillary Clinton Drafting Illegal Immigrants To Boost Voter Turnout https://t.co/Hun5WxUgn1"
## [3] "learning some of my flaws as a competitor brings a sense of security"
## [4] "@JoeNBC Goldman Sachs the in card 4 Wall Street Evan McMullin think American_"
## [5] "RT @SportsCenter Simone Biles and Aly Raisman take gold and silver in Individual All-Around for #USA. https://t.co/vCHisn2KhM"
text_token <- text_token %>% stringr::str_replace_all("RT *", "") %>%
stringr::str_replace_all("https://t.co/[a-z,A-Z,0-9]*", "") %>%
stringr::str_replace_all("[[:digit:]]", "") %>%
iconv("latin1", "ASCII", sub="") %>% stringr::str_squish() %>%
tolower()
text_token[1:5] #sample of the first 5 tweets
## [1] "explosions rock syrian border as turkey presses operation"
## [2] "@irritatedwoman hillary clinton drafting illegal immigrants to boost voter turnout"
## [3] "learning some of my flaws as a competitor brings a sense of security"
## [4] "@joenbc goldman sachs the in card wall street evan mcmullin think american_"
## [5] "@sportscenter simone biles and aly raisman take gold and silver in individual all-around for #usa."
Next, I converted my vector of tweets into a corpus and created a document term matrix. The document term matrix treats each tweet as a document. Here is an example (note: The example is a document frequency matrix, but the idea is the same):
example <- c("I ran to the store", "I walked to the store", "I ran to the park")
example_dfm <- dfm(example, remove=stopwords("english"), verbose=TRUE)
## Creating a dfm from a character input...
## ... lowercasing
## ... found 3 documents, 7 features
## ... removed 3 features
## ... created a 3 x 4 sparse dfm
## ... complete.
## Elapsed time: 0.5 seconds.
example_dfm
## Document-feature matrix of: 3 documents, 4 features (50.0% sparse).
## 3 x 4 sparse Matrix of class "dfm"
## features
## docs ran store walked park
## text1 1 1 0 0
## text2 0 1 1 0
## text3 1 0 0 1
As you can see above, rows correspond to text in the collection and columns correspond to the total terms present in the text. The document term matrix assigns a 1 to words present in each document.
Now, back to our larger document frequency matrix. The corpus contains 28,053 “documents” (i.e., tweets). I removed punctuation (e.g., _ or .), stopwords, and text that appear in less than 10 tweets (i.e., 46,455 terms)
corpus <- tm::Corpus(VectorSource(text_token))
corpus <- corpus %>% tm_map(removePunctuation) %>% tm_map(removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
text_DTM <- DocumentTermMatrix(corpus)
Some of the tweets are blank (after cleaning the text). Consequently, these must be removed for topic modeling.
## remove blank tweets
rowTotals <- apply(text_DTM, 1, sum) #Find the sum of words in each Document
text_DTM <- text_DTM[rowTotals> 0, ]
To determine the number of topics (k) I performed 5-fold cross validation, with potential k set to 2, 3, 4, 5, 10, 20, 30, 40, and 50. 5-fold cross validation fits five different models for each potential k.
folds <- 5
split <- sample(1:folds,nrow(text_DTM), replace = TRUE)
topics <- c(2:5, 10, 20, 30, 40, 50) # candidates for how many topics
results <- foreach(j = 1:length(topics), .combine = rbind) %dopar% {
k <- topics[j]
results <- matrix(0, nrow = folds, ncol = 2)
colnames(results) <- c("k", "perplexity")
for (i in 1:folds){
train_set <- text_DTM[split != i,]
test_set <- text_DTM[split == i,]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(seed = 123, burnin = 100, iter = 500))
results[i,] <- c(k, perplexity(fitted, newdata = test_set))
}
return(results)
}
## Warning: executing %dopar% sequentially: no parallel backend registered
Lastly, I use perplexity to determine the best possible k. Perplexity determine how well the training set predicts the held-out set (i.e., validation or testing set). Low perplexity means that the training set is least “perplexed” by the held-out set (i.e., better at making predictions).
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = F) +
ggtitle("5-fold cross-validation for 'Immigrant Tweets'") +
labs(x = "k", y = "Perplexity")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'