Post 3: Pre-processing Text

Tokenization + Preprocessing

Analysis on this data was done using the Quanteda package in R. After manually selecting articles published during and after the year 2000, I began pre-processing the text. First, I created tokens by word which reduce the text into smaller and more interpretable objects. Next, I perform a series of pre-processing steps that help reduce noise in the data and increase computational efficiency while doing the analysis.

Step 1: Remove punctuation, numbers, urls and stopwords that are common to the english language. This reduces the noise in the data and brings a clearer focus on the keywords that will help identify the topics. After this, I move all tokens to lowercase to further reduce noise.

Step 2: I perform stemming which reduces most words to its root note - for example - torture, tortured, tortures. I chose this technique because we’re not working with context as of now and our goal is to understand the topics prevalent in the data. Stemmed words are likely to confuse the analysis and not necessarily help look for topics efficiently.

Step 3: I create a compound token for “Human Rights” which tells my analysis that Human Rights in this model is one word rather than two.

I then create document feature matrix (dfm) where the minimum term frequency is set to 30, i.e include those words that occur 30 or more times in the dfm.

library(readtext)
##use readtext to load in folder consisting of articles
articles <- readtext(("/Users/isha/Desktop/GitHub/HumanRightsTextNLP/articles_final/*.txt"))

#select text column from articles
tokens <- articles$text %>% 
  #tokenize to words
          tokens(what = "word",
                 #remove punctuation
                 remove_punct = TRUE,
                 #remove numbers
                 remove_numbers = TRUE, 
                 #remove urls
                 remove_url = TRUE
                 ) %>% 
  #change all tokens to lowercase
  tokens_tolower() %>% 
  #remove common stop words from the english language
  tokens_remove(stopwords("english")) %>% 
  #stem using quanteda's language stemmer
  #lemmetization potential here#
  tokens_wordstem(language = quanteda_options("language_stemmer")) %>% 
  #compound token to keep the word "human right" together
  #add un here 
  tokens_compound(pattern = c("human right*", "u.s.*", "domestic violence*"))

#applying relative pruning, create document feature matrix where the minimum term frequency is set to 30, i.e include those words that occur 30 or more times in the dfm.
dfm <- dfm_trim(dfm(tokens), min_docfreq = 0.15, max_docfreq = 0.90, min_termfreq = 75, docfreq_type = "prop", verbose = TRUE)

## Removing features occurring:

##   - fewer than 75 times: 218,126

##   - in fewer than 859.2 documents: 225,824

##   - in more than 5155.2 documents: 19

##   Total features removed: 225,843 (99.3%).

#remove additional characters
dfm <- dfm_remove(dfm,c("<",">", "however", "although"))

I create a wordcloud with the top 50 words in the cleaned document feature matrix.

textplot_wordcloud(dfm, max_words = 50, random_order = TRUE, color = "#0086b3")

Post 3: Pre-processing Text

Isha Akshita Mahajan

Tokenization + Preprocessing