Immigration Keyword Dictionary Creation

This script outlines steps taken during the creation of our immigration keyword dictionary. The keyword dictionary will be used to collect tweets discussing immigration in the 3 months leading to the 2016 Presidential Election (i.e., August 8th to November 8th).

1. First, I collected tweets with the keywords immigration, immigrant, and immigrants. Next, I randomly sampled 10,000 tweets for each month of interest.

text_token <- immigration_tweets$Contents[!duplicated(immigration_tweets$Contents)] #remove duplicates
text_token[1:5] #sample of the first 5 tweets
## [1] "RT @shezumi lazy immigrants, going to a country that's not theirs and taking things for free. https://t.co/P0ZUZvJJTt"                                
## [2] "PayPal cofounder Max Levchin praises White House proposal to attract immigrant entrepreneurs https://t.co/1pClubDsrk"                                 
## [3] "RT @JaredWyand #ManyPeopleAreSaying Muslim immigration is a plot to make us choose between freedom & safety ending in control https://t.co/OqSMCo0U0W"
## [4] "Melania #Trump will address immigration controversy in next couple of weeks, D Trump says. 14 days & counting #GOP https://t.co/xEIk0xQl7N"           
## [5] "RT @WeNeedTrump RETWEET if you want American citizens to come before refugees and illegal immigrants. https://t.co/LX87WxAbMY"

The tweets are read into R, but the text includes a lot of unnecessary elements. To clean the text, I removed urls, the RT notation, numbers, emojis, and white space. I also made of the words lowercase. All hashtags and @mentions remain.

text_token  <- text_token  %>% stringr::str_replace_all("RT *", "") %>%
  stringr::str_replace_all("https://t.co/[a-z,A-Z,0-9]*", "") %>% 
  stringr::str_replace_all("[[:digit:]]", "") %>% 
  iconv("latin1", "ASCII", sub="") %>% stringr::str_squish() %>%
  tolower()
text_token[1:5]
## [1] "@shezumi lazy immigrants, going to a country that's not theirs and taking things for free."                                
## [2] "paypal cofounder max levchin praises white house proposal to attract immigrant entrepreneurs"                              
## [3] "@jaredwyand #manypeoplearesaying muslim immigration is a plot to make us choose between freedom & safety ending in control"
## [4] "melania #trump will address immigration controversy in next couple of weeks, d trump says. days & counting #gop"           
## [5] "@weneedtrump retweet if you want american citizens to come before refugees and illegal immigrants."

Now that the text is clean, I examined frequent words using quanteda.

token <- tokens_remove(tokens(text_token, remove_punct = T, remove_twitter = F), stopwords("english")) ### tokenize text (removing punctuation & stopwords)

Here are the tokenized words.

token[1:5]
## tokens from 5 documents.
## text1 :
## [1] "@shezumi"   "lazy"       "immigrants" "going"      "country"   
## [6] "taking"     "things"     "free"      
## 
## text2 :
##  [1] "paypal"        "cofounder"     "max"           "levchin"      
##  [5] "praises"       "white"         "house"         "proposal"     
##  [9] "attract"       "immigrant"     "entrepreneurs"
## 
## text3 :
##  [1] "@jaredwyand"          "#manypeoplearesaying" "muslim"              
##  [4] "immigration"          "plot"                 "make"                
##  [7] "us"                   "choose"               "freedom"             
## [10] "safety"               "ending"               "control"             
## 
## text4 :
##  [1] "melania"     "#trump"      "will"        "address"     "immigration"
##  [6] "controversy" "next"        "couple"      "weeks"       "d"          
## [11] "trump"       "says"        "days"        "counting"    "#gop"       
## 
## text5 :
## [1] "@weneedtrump" "retweet"      "want"         "american"    
## [5] "citizens"     "come"         "refugees"     "illegal"     
## [9] "immigrants"
n_grams <- tokens_ngrams(token, n = 1:2, concatenator = " ") ### unigram and bigram frequency
ngram_dfm <- dfm(n_grams) ##document frequency matrix
Grams_imm <- topfeatures(ngram_dfm,300) #top 300 unigrams and bigrams

Here are the top 300 unigrams and bigrams. I used key words associated with immigration to expand the keyword dictionary for another search. Added words included border, illegal immigrant, immigration policy, immigration reform, and border.

Grams_imm
##             immigration              immigrants                   trump 
##                    9959                    7260                    4171 
##                 illegal               immigrant      illegal immigrants 
##                    3223                    2811                    1510 
##                     via            #immigration                    will 
##                    1393                    1162                    1149 
##                  donald                      us                 trump's 
##                    1083                    1062                     990 
##     illegal immigration        @realdonaldtrump                     new 
##                     899                     863                     861 
##                  muslim                  people            donald trump 
##                     819                     684                     670 
##                 hillary                  policy            undocumented 
##                     653                     648                     646 
##                 clinton                     u.s                    says 
##                     635                     635                     635 
##                 america                    like                    plan 
##                     618                     609                     607 
##                    just                    vote                     now 
##                     605                     542                     540 
##                  speech                     get       illegal immigrant 
##                     539                     513                     505 
##               americans                american                  #trump 
##                     499                     499                     497 
## undocumented immigrants                   legal                     can 
##                     494                     491                     477 
##                   obama      immigration policy                  border 
##                     466                     446                     443 
##                 country                  trumps                    want 
##                     436                     428                     424 
##                    jobs                  reform                     one 
##                     422                     418                     416 
##                    stop                     law        immigration plan 
##                     404                     388                     376 
##                    laws                   wants                   taxes 
##                     374                     364                     360 
##                     say     trump's immigration                    wall 
##                     352                     351                     346 
##                    need      immigration speech      immigration reform 
##                     341                     334                     315 
##                    said                campaign                 million 
##                     312                     311                     310 
##                   white                   going                refugees 
##                     310                     310                     309 
##                policies                     pay                       w 
##                     303                     300                     299 
##                    back                   #maga                    many 
##                     299                     299                     297 
##                    know                election                 melania 
##                     296                     296                     293 
##                   right                    make                   women 
##                     290                     288                     287 
##                 support         @hillaryclinton             citizenship 
##                     286                     279                     273 
##                    news                   first                   #tcot 
##                     272                     272                     269 
##                citizens                   video      muslim immigration 
##                     269                     266                     260 
##                security                    time                    good 
##                     256                     256                     253 
##                 muslims             #immigrants                 vetting 
##                     252                     251                     248 
##                  voters                @foxnews                   court 
##                     247                     247                     245 
##                    mass        immigration laws                   issue 
##                     244                     241                     240 
##                    work                 mexican                  mexico 
##                     240                     238                     237 
##                   think                 economy                    help 
##                     234                     234                     233 
##                    must                children       trump immigration 
##                     231                     229                     229 
##                   great                    year                      go 
##                     226                     225                     225 
##                   still                   every                  deport 
##                     224                     222                     222 
##          anti-immigrant                   crime                 borders 
##                     221                     219                     218 
##                    come                   black       muslim immigrants 
##                     217                     216                     215 
##                  racist                    open                  status 
##                     213                     210                     208 
##                  report                   never                    even 
##                     208                     207                     207 
##                     ban                   years                   watch 
##                     206                     204                     203 
##          donald trump's               president                  nation 
##                     200                     196                     194 
##                  better                    keep                    care 
##                     192                     192                     191 
##                    take                  debate                     see 
##                     191                     190                     190 
##                   media                families                     usa 
##                     183                     183                     182 
##                   state                 federal    immigration policies 
##                     182                     182                     181 
##                  stance                   story             deportation 
##                     181                     181                     180 
##                     way                   world                  system 
##                     180                     177                     176 
##                  killed                       u                 extreme 
##                     173                     172                     171 
##              trump says                     end                 workers 
##                     171                     170                     170 
##                    test        @fairimmigration                national 
##                     169                     168                     168 
##                criminal               terrorism                  really 
##                     168                     168                     167 
##                 supreme                 problem                    also 
##                     167                     167                     165 
##                     tax                     day                     bad 
##                     165                     163                     162 
##                    much                    talk                     may 
##                     162                     161                     160 
##                    give                    read                 refugee 
##                     160                     160                     160 
##                     let                   times                   today 
##                     160                     158                     158 
##         hillary clinton                     gop           supreme court 
##                     157                     157                     156 
##                  issues                illegals                    call 
##                     155                     155                     155 
##            #debatenight              supporters                deported 
##                     155                     154                     154 
##                     big              government                    bill 
##                     153                     153                     150 
##               breitbart                 history                   party 
##                     149                     148                     147 
##               detention                     two                  family 
##                     146                     146                     146 
##                  voting                   shows                    poll 
##                     146                     146                     145 
##                    tell                position                 #debate 
##                     144                     144                     143 
##                  coming         immigration law                @youtube 
##                     141                     141                     141 
##                  change         extreme vetting               illegally 
##                     141                     140                     140 
##                    race                hispanic                   calls 
##                     138                     138                     138 
##                    home                     man                   build 
##                     138                     137                     137 
##           melania trump                    made                      uk 
##                     137                     136                     136 
##                       s                   trade               countries 
##                     136                     136                     135 
##                 another                   kaine                 working 
##                     135                     135                     134 
##                    look                   since      trumps immigration 
##                     134                     132                     132 
##                   soros                    hate         immigration via 
##                     132                     132                     130 
##                  rights                    live                   pence 
##                     130                     129                     128 
##                       r                  please                    fact 
##                     128                     127                     127 
##                    real                    left              washington 
##                     126                     126                     126 
##                   needs                   plans                    love 
##                     126                     126                     126 
##               softening                    ever                 islamic 
##                     125                     125                     125 
##                 nothing                   study                     got 
##                     124                     124                     124 
##            presidential                     dhs                     cnn 
##                     124                     123                     123 
##                    last                 talking        mass immigration 
##                     123                     123                     122 
##                     due                   woman                   #news 
##                     122                     121                     120 
##                  saying                 amnesty                    life 
##                     120                     120                     120 
##                    case             #trumppence                     yes 
##                     119                     119                     119 
##                    post                    free                    show 
##                     118                     117                     117 
##               candidate             enforcement                 private 
##                     117                     117                     116 
##                   bring               democrats                   lives 
##                     116                     116                     116 
##                  trying                    isis                    stay 
##                     116                     116                     115 
##                 billion                    used               questions 
##                     115                     115                     115 
##                    next                     job                  states 
##                     115                     115                     114 
##                 arizona      immigration status                  europe 
##                     113                     113                     113 
##            open borders                 latinos                     top 
##                     112                     112                     111 
##                 obama's                #hillary                 believe 
##                     111                     110                     110 
##                 foreign                  become                     fox 
##                     110                     109                     109 
##                millions                 control                    york 
##                     109                     109                     109 
##                   fraud                       b          killed illegal 
##                     108                     108                     107 
##                   thing           donald trumps                   wrong 
##                     107                     107                     106

2. With an expanded list of keywords, I collected another sample of 30,000 tweets (10,000/month).

Again, I cleaned the text, removing unwanted items (e.g., urls, emojis, numbers)

text_token <- immigration_tweets2$Contents[!duplicated(immigration_tweets2$Contents)] #remove duplicates
text_token[1:5] #sample of the first 5 tweets
## [1] "Explosions rock Syrian border as Turkey presses operation https://t.co/Jh5kZWV0Hx"                                            
## [2] "RT @irritatedwoman Hillary Clinton Drafting Illegal Immigrants To Boost Voter Turnout https://t.co/Hun5WxUgn1"                
## [3] "learning some of my flaws as a competitor brings a sense of security"                                                         
## [4] "@JoeNBC Goldman Sachs the in card 4 Wall Street Evan McMullin think American_"                                                
## [5] "RT @SportsCenter Simone Biles and Aly Raisman take gold and silver in Individual All-Around for #USA. https://t.co/vCHisn2KhM"
text_token  <- text_token  %>% stringr::str_replace_all("RT *", "") %>%
  stringr::str_replace_all("https://t.co/[a-z,A-Z,0-9]*", "") %>% 
  stringr::str_replace_all("[[:digit:]]", "") %>% 
  iconv("latin1", "ASCII", sub="") %>% stringr::str_squish() %>%
  tolower()
text_token[1:5] #sample of the first 5 tweets
## [1] "explosions rock syrian border as turkey presses operation"                                         
## [2] "@irritatedwoman hillary clinton drafting illegal immigrants to boost voter turnout"                
## [3] "learning some of my flaws as a competitor brings a sense of security"                              
## [4] "@joenbc goldman sachs the in card wall street evan mcmullin think american_"                       
## [5] "@sportscenter simone biles and aly raisman take gold and silver in individual all-around for #usa."

Next, I converted my vector of tweets into a corpus and created a document term matrix. The document term matrix treats each tweet as a document. Here is an example (note: The example is a document frequency matrix, but the idea is the same):

example <- c("I ran to the store", "I walked to the store", "I ran to the park")

example_dfm <- dfm(example, remove=stopwords("english"), verbose=TRUE)
## Creating a dfm from a character input...
##    ... lowercasing
##    ... found 3 documents, 7 features
##    ... removed 3 features
##    ... created a 3 x 4 sparse dfm
##    ... complete. 
## Elapsed time: 0.5 seconds.
example_dfm 
## Document-feature matrix of: 3 documents, 4 features (50.0% sparse).
## 3 x 4 sparse Matrix of class "dfm"
##        features
## docs    ran store walked park
##   text1   1     1      0    0
##   text2   0     1      1    0
##   text3   1     0      0    1

As you can see above, rows correspond to text in the collection and columns correspond to the total terms present in the text. The document term matrix assigns a 1 to words present in each document.

Now, back to our larger document frequency matrix. The corpus contains 28,053 “documents” (i.e., tweets). I removed punctuation (e.g., _ or .), stopwords, and text that appear in less than 10 tweets (i.e., 46,455 terms)

corpus <- tm::Corpus(VectorSource(text_token))

corpus <- corpus %>% tm_map(removePunctuation) %>% tm_map(removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
text_DTM <- DocumentTermMatrix(corpus)

Some of the tweets are blank (after cleaning the text). Consequently, these must be removed for topic modeling.

## remove blank tweets 

rowTotals <- apply(text_DTM, 1, sum) #Find the sum of words in each Document
text_DTM   <- text_DTM[rowTotals> 0, ] 

To determine the number of topics (k) I performed 5-fold cross validation, with potential k set to 2, 3, 4, 5, 10, 20, 30, 40, and 50. 5-fold cross validation fits five different models for each potential k.

folds <- 5
split <- sample(1:folds,nrow(text_DTM), replace = TRUE)
topics <-  c(2:5, 10, 20, 30, 40, 50)  # candidates for how many topics


  results <- foreach(j = 1:length(topics), .combine = rbind) %dopar% {
    k <- topics[j]
    results <- matrix(0, nrow = folds, ncol = 2)
    colnames(results) <- c("k", "perplexity")
    for (i in 1:folds){
      train_set <- text_DTM[split != i,]
      test_set <-  text_DTM[split == i,]
      
     fitted <- LDA(train_set, k = k, method = "Gibbs",
                  control = list(seed = 123, burnin = 100, iter = 500))
     results[i,] <- c(k, perplexity(fitted, newdata = test_set))
}
    return(results)
}
## Warning: executing %dopar% sequentially: no parallel backend registered

Lastly, I use perplexity to determine the best possible k. Perplexity determine how well the training set predicts the held-out set (i.e., validation or testing set). Low perplexity means that the training set is least “perplexed” by the held-out set (i.e., better at making predictions).

results_df <- as.data.frame(results)

ggplot(results_df, aes(x = k, y = perplexity)) +
   geom_point() +
   geom_smooth(se = F) +
   ggtitle("5-fold cross-validation for 'Immigrant Tweets'") +
   labs(x = "k", y = "Perplexity")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'