Executive Summary

The Capstone Proyect of the Johns Hopkins Univ. Data Science Specialization consists of building a text prediction algorithm, and to deploy it as a Shiny application data product.

The proyect starts with analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It then proceeds to clean & analyze text data, and then building a predictive text model.

This Milestone Report involves explaining the first tasks of the Capstone Proyect:

Obtaining the data

The data comes from a corpus that is located in the following website: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The data is from a corpus called HC Corpora. There are 4 different languages in the corpus (English, Romanian, Finnish and Danish), however the project specifies we are to use only the English corpus. So we download the entire corpus but only work with the English files (which have been language filtered but may still contain some foreign text).

So we proceed to download the English corpus into R and obtain basic information from the data:

con_Twitter <- file("C:\\Users\\Thruinin II\\Documents\\R_files\\Course 10 - Capstone\\Raw datasets\\en_US.twitter.txt", "r") 
con_Blogs <- file("C:\\Users\\Thruinin II\\Documents\\R_files\\Course 10 - Capstone\\Raw datasets\\en_US.blogs.txt", "r") 
con_News <- file("C:\\Users\\Thruinin II\\Documents\\R_files\\Course 10 - Capstone\\Raw datasets\\en_US.news.txt", "rb") 


Twitter_full <- readLines(con_Twitter, skipNul = TRUE)
Blogs_full <- readLines(con_Blogs, skipNul = TRUE)
News_full <- read_lines(con_News, skip_empty_rows = TRUE)

So there is a text file with Twitter tweets, another one with entries from Blogs, and a 3rd one with news feeds.

The file sizes are:

## [1] "Twitter tweets file = 334.5 Mb"
## [1] "Blogs file = 267.8 Mb"
## [1] "News file = 269.8 Mb"

Each file contains the following # of lines and of words:

File Lines Words
Twitter 2,360,148 30,373,583
Blogs 899,288 37,334,131
News 1,010,242 34,372,530
## [1] "Total # of lines in corpus = 4,269,678"

The files are quite big, so for computational purposes we will use a random sample from the 3 files. We will use aproximately 5,8% of the total lines that are in the text corpus:

set.seed(58)
sampleTwitter <- sample(Twitter_full, 80000, replace = FALSE)

set.seed(9)
sampleBlogs <- sample(Blogs_full, 90000, replace = FALSE)

set.seed(91)
sampleNews <- sample(News_full, 80000, replace = FALSE)

Once we have our 3 samples, we build the text corpus we will work with to analyse the text data.
I chose the QUANTEDA R package for its simplicity, and because it uses less memory and thus runs faster. It also allows to work with data tables (which also are faster way to build, process & to search than dataframes).

My app will be called “Typing Caddie” –therefore the corpus I’ll work with (to build the text prediction algorithm) is called “Caddie corpus”.

### Using QUANTEDA package we build the corpus  ###
caddie_corpus <- corpus(c(sampleNews, sampleTwitter, sampleBlogs))

### Using a corpus of 250k texts  ##################
### A bigger Corpus (>250k texts) does not result in less sparsity or in change of the top n-gram frequencies

rm(sampleBlogs)
rm(sampleTwitter)
rm(sampleNews)

The proyect suggested removing profane words (and particularly in the Blogs file, there are plenty), so we are removing the profane words from our corpus:

### Read list of profane words
### The list was obtained from <https://code.google.com/archive/p/badwordslist/downloads>
### Contains ~1300 words.  After review, the list was trimmed down to ~1000 words to speed up 
### processing of the removal. The processed corpus is unaffected by the trimming down (similar corpus obtained). 

profanity <- read_lines("C:\\Users\\Thruinin II\\Documents\\R_files\\Course 10 - Capstone\\bad-words.txt")

profanity <- str_c(profanity, collapse = "|")

caddie_corpus <- str_remove_all(caddie_corpus, profanity )
### Completed removing profanity

Next steps are:

  • removing symbols and foreign words that appear frequently such as â, ¥, ð, Â, ã

  • removing “noise” from the corpus (punctuation, symbols, separators, URL addresses & strange text strings that do not constitute words)

To perform the last task mentioned above we used “tokenize” the corpus using the “TOKENS” command in the QUANTEDA package.

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Here is an example of tokenization:

Input: “Friends, Romans, Countrymen, lend me your ears;”

Output: “Friends”, “Romans”, “Countrymen”, “lend”, “me”, “your”, “ears”

caddie_corpus <- str_remove_all(caddie_corpus, "â|¥|ð|~|Â|¦|ã|")

caddie_tokens <- tokens(caddie_corpus, remove_symbols = TRUE, remove_numbers = TRUE, remove_separators = TRUE,
                        remove_url = TRUE, remove_punct = TRUE)

More data cleaning

  • removing single words that don’t have meaning in English (frequently used in Twitter % blogs to abbreviate, such as “s”, “u”, etc)

    caddie_tokens <- tokens_select(caddie_tokens, c("\\bw\\b","\\bs\\b", "\\bt\\b","\\bm\\b"), selection = "remove", valuetype = "regex")
    
    caddie_tokens <- tokens_select(caddie_tokens, c("\\bo\\b","\\be\\b","\\bn\\b","\\bd\\b"), selection = "remove", valuetype = "regex")

STOPWORDS

Stopwords are common words that are present in the text but generally do not contribute to the meaning of a sentence. They hold almost no importance for the purposes of information retrieval and natural language processing. They can safely be ignored without sacrificing the meaning of the sentence. For example – ‘the’ and ‘a’.
Stopwords are usually thought of as “the most common words in a language”.

However, for this proyect I don’t want to remove STOPWORDS: caddie_tokens <- tokens_select(caddie_tokens, pattern = stopwords(‘en’), selection = ‘remove’), since many stopwords need to be used for the text prediction model(3).

STEMMING

In Natural Language Processing (NLP), there may come a time when you want your program to recognize that the words “ask” and “asked” are just different tenses of the same verb. This is the idea of reducing different forms of a word to a core root. Words that are derived from one another can be mapped to a central word or symbol, especially if they have the same core meaning. Maybe this is in an information retrieval setting and you want to boost your algorithm’s recall. Or perhaps you are trying to analyze word usage in a corpus and wish to condense related words so that you don’t have as much variability. Either way, this technique of text normalization may be useful to you.

Stemming is the process of reducing words to their word stems. A word stem need not be the same root as a dictionary-based morphological root –it just is an equal to or smaller form of the word.

We will not stemm our corpus for the time being -we’ll see later if stemming reduces our processing time.

Statistical language models assign probabilities to the sequences of words. The simplest model that assigns probabilities to sentences and sequences of words is the n-gram model.

You can think of an N-gram as the sequence of N words. So then:

  • a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”,

  • a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework” (2)

So we need to know which are the ngrams that exist in our source data. Using the tokens_ngrams command of the QUANTEDA package, we proceed to create document-frecuency-matrices (DFMs) for unigrams up to 4-grams inclusive.

The DFMs will be called:

  • caddie-tokens_DFM (unigrams)

  • caddie-bigrams_DFM (2-grams),

  • and so on

We then proceed to analyse the ngrams obtained.
We will perform visual plots, and histograms of the 50 most frequent grams:

UNIGRAMS

BIGRAMS

TRIGRAMS

4-grams

From review of the ngrams obtained we can see that all of them are meaningful (there are no ngrams within the 50 most frequent, that do not have a meaning in the English language).

To evaluate & remove foreign words, we could use the WordNet package. But we will not do this in this case, since from review of first 150 tokens obtained for unigrams to 4-grams, there appear to be very few foreign words.

How many unique words make up of 50% words appearing in the corpus?

words_coverage <- data.frame(
    coverage = round(cumsum(resumen_words$frequency) / sum(resumen_words$frequency) * 100, 2),
    words = 1:nrow(resumen_words)
)


words_coverage <- words_coverage %>% filter(between(coverage, 49, 51))
### This comes to aprox 125 words in the top tier  (0,1 % of words)

fifty_percent <- mean (words_coverage$words)

fifty_percent_cover <- ( (fifty_percent/length(resumen_words$feature)) * 100)
## [1] "The total # of unique words (tokens) in our text corpus is 168472 words"
## [1] "50 % of word counts appearing in the corpus is covered with 0.08 % of the total Nr. of unique words"

So, our source text corpus is quite compressed in terms of coverage — a small # of words (less than 0,1%) covers >50 % of word occurrences in it. This means we could get rid of many “single-ocurrence” words whilst building our model, making it run faster –since it will have to search for less N-grams.

NLP is a resource intensive process, therefore we will have to use a rather small subset of the actual data to build the corpus for the text prediction model. Around 5% of the text corpus seem adequate as we have seen.

We then use this subset of the corpus to build n-grams using tokenization with the QUANTEDA package. At the end of tokenization process we have bigrams, 3-grams and 4-grams We will use the n-grams tables that we will create as input for an N-gram model.

In our model, we will have to deal with the user typing words that are in our “n-grams/token” vocabulary, but that appear in a test set in an unseen context (for example they appear after a word they never appeared after in training) (so in a way, they are “unknown words” for our model).

To keep a language model from assigning ZERO (0) probability to these unseen events, we have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. This modification is called smoothing or discounting. There are several ways to do smoothing: add-1 smoothing, add-k smoothing, stupid backoff, and Kneser-Ney smoothing(1).

A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts.

Discounting can help solve the problem of zero frequency N-grams. But there is an additional source of knowledge we can draw on. If we are trying to compute the probability P(wn | wn-2 wn-1) but we have no examples of a particular of a particular trigram wn-2 wn-1 wn, we can instead estimate its probability by using the bigram probability P(wn | wn-1).

Similarly, if we don’t have counts to compute P(wn | wn-1) we can look to the unigram P(wn).

In other words, sometimes using less context is a good thing, helping to generalize more for contexts that the model hasn’t learned much about. In backoff, we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise the unigram. In other words, we only “back off” to a lower-order n-gram if we have zero evidence for a higher-order interpolation n-gram(1).

The above explained is the Katz backoff algorithm that I will use to deal with unknown words OR unknown sequences of N-grams.

REFERENCES

1.- Speech and Language Processing. Daniel Jurafsky & James H. Martin. 3rd draft edition, 16/Oct/19.
2.- Languague Models: N-Gram. https://towardsdatascience.com/introduction-to-language-models-n-gram-e323081503d9
3.- Stopwords - Important for language, not so much for NLP - https://www.linkedin.com/pulse/stopwords-important-language-so-nlp-sunakshi-mamgain