Overview

We aim at developing a machine learning algorithm to predict the next word given a word or phrase. To achieve this goal, the first step is to build a language model by ingesting a corpus of documents. These documents are used to understand the distribution of words and how they are put together. For this reason, this exploratory data analysis is focused on extracting n-grams - sequence of words of length n - from the corpus.

We first load the data and provide a basic summary of the raw dataset. To speed up the analysis, we generate a random sample of the original dataset. This representative subset of the data is then cleaned. More specifically, we filter profanity and entries we don’t want to predict such as punctuation. We then move to the analysis of the data and extract n-grams to understand the frequency of single words and sequences of words encountered in the corpus. In the last section, we lay down plans for the building of the learning algorithm and the development of the associated shiny application.

DataSet

Three text files are available in the english corpus. A summary statistics is given below.

dir <- "../../final/en_US/"
files <- paste0(dir, list.files(dir) )
nFiles <- length(files)

info <- function(fileName){
    size <- file.info(fileName)$size / 1024^2 # size of the file in MB
    
    wc <- unlist(strsplit(system2("wc", args = paste("-lw", fileName), stdout = TRUE), " ") )
    nLines <- (wc[wc != ""])[1] # number of lines in the file
    nWords <- (wc[wc != ""])[2] # number of words in the file
    
    df <- data.frame(as.numeric(size), as.numeric(nLines), as.numeric(nWords) )
    colnames(df) <- c("size", "nLines", "nWords")
    rownames(df) <- sub("^.*/", "", fileName)
    
    return(df)
}

for (i in seq(nFiles) ) {
    if (i == 1) files.info <- info(files[i])
    else files.info = rbind(files.info, info(files[i]) )
}

library("knitr")
kable(files.info, format.args = list(big.mark = ","), digit = 0, col.names = c("Size (MB)", "Lines", "Words") )

	Size (MB)	Lines	Words
en_US.blogs.txt	200	899,288	37,334,690
en_US.news.txt	196	1,010,242	34,372,720
en_US.twitter.txt	159	2,360,148	30,374,206

The files are pretty large. For this reason, we will consider only a small sample in this analysis.

Preprocessing

Sampling

We randomly sample 5,000 entries from every document available and save the extracted lines in a file on disk.

set.seed(128) # for reproducibility
nSample <- 5000
sampleFile <- function(fileName) {
    keep <- sample(info(fileName)$nLines, nSample)
    file <- readLines(fileName, skipNul = TRUE, encoding = "UTF-8")
    sample <- file[keep]

    return(sample)
}

for (i in seq(nFiles) ) {
    if (i == 1) selected <- sampleFile(files[i])
    else selected <- rbind(selected, sampleFile(files[i]) )
}

writeLines(selected, con = "sample.txt")
sample.info <- info("sample.txt")

This file is only 2 MB and encloses 15,000 lines and 439,363 words.

Cleaning

Let’s start by lowercasing all characters.

clean <- tolower(selected)

We now remove, urls, email addresses, hashtags and twitter usernames.

clean <- gsub("http\\S+\\s*", "", clean); clean <- gsub("www\\S+\\s*", "", clean)   # urls
clean <- gsub('\\S+@\\S+', "", clean)                                               # emails
clean <- gsub('#\\S+', "", clean); clean <- gsub('@\\S+', "", clean)                # twitter

In the next chunk, we now extract profanity and offensive words. The list has been downloaded at this url.

badwords <- readLines("badwords.txt", skipNul = TRUE, encoding = "UTF-8")
for (word in badwords) clean <- gsub(word, "", clean)

Finally, we remove numbers and punctuations. Also, possible multi/trailing spaces need to be removed/collapsed. This can easily be achieved using the ngram package. Functions from this package manipulate a single string. For this reason, we first concatenate the corpus.

library("ngram")
words <- concatenate(clean)
words <- preprocess(words, remove.punct = TRUE, remove.numbers = TRUE, fix.spacing = TRUE)

Note that we did not remove stopwords (words such as the, also, and, …) from the corpus. Although these words have little significance, they are extremely frequent and we believe it’s important to consider them in this project. Our model will heavily rely on the n-grams, i.e., an ordered sequence of words pf length n, extracted from the corpus. A vast majority of n_grams would be meaningless if stopwords were to be removed and, in turn, the model will perform poorly.

Analysis

The next step consists in retrieving n-grams from the cleaned corpus. The ngram package makes this easy. We also write in the next chunk some useful functions for performing n-gram analysis.

getNGram <- function(words, n, print = TRUE) {
    ngram <- ngram(words, n = n)
    df <- get.phrasetable(ngram)
    if (print == TRUE) {
        print(kable(head(df, n = 5), format.args = list(big.mark = ","), digit = 5,
              col.names = c(paste0(n, "-gram"), "Frequency", "Proportion") ) )
    }
    return(df)
}

library("RColorBrewer")
library("wordcloud")
getNGramWordCloud <- function(ngram) {
    wordcloud(ngram$ngrams, ngram$freq, max.words = 50, min.freq = 5, scale = c(2.5, .5),
              colors = brewer.pal(6, "Dark2") )
}

library("ggplot2")
getNGramFreq <- function(ngram, n = 15) {
    title <- paste0(n, " most frequent ", length(strsplit(ngram[1,1], " ")[[1]]), "-gram.")
    ggplot(ngram[1:n,], aes(x = reorder(ngrams, freq), y = freq) ) + geom_bar(stat = "identity") +
        geom_text(aes(label = sprintf("%1.2f%%", 100*prop) ), hjust = 1.25, colour = "white", size = 3) + 
        coord_flip() + labs(title = title, x = "", y = "Count")

}

We can now produce the (1, 2 and 3)-grams and give some basic statistics. Word clouds and bar plots are shown along with the tables.

Unigram

unigram <- getNGram(words, 1, print = TRUE)

1-gram	Frequency	Proportion
the	21,931	0.05149
to	12,154	0.02853
and	11,125	0.02612
a	10,551	0.02477
of	9,375	0.02201

getNGramFreq(unigram)
getNGramWordCloud(unigram)

It is worth noting that the 10 most frequent unigram represent 22 of all the unigram extracted.

Bigram

bigram  <- getNGram(words, 2, print = TRUE)

2-gram	Frequency	Proportion
of the	2,027	0.00476
in the	1,957	0.00459
to the	1,010	0.00237
on the	922	0.00216
for the	811	0.00190

getNGramFreq(bigram)
getNGramWordCloud(bigram)

Trigram

trigram <- getNGram(words, 3, print = TRUE)

3-gram	Frequency	Proportion
one of the	147	0.00035
a lot of	139	0.00033
to be a	92	0.00022
as well as	78	0.00018
out of the	73	0.00017

getNGramFreq(trigram)
getNGramWordCloud(trigram)

Next steps

The very next step is to develop an algorithm to predict the sequence of words. The n-gram models allow for the assignment of probabilities to sequence of words. Using the n-grams that have been extracted from the corpus, we can easily estimate the probability of the last word of an n-gram given the previous words. Markov Chains are a easy way to store and query n-gram probabilities. There are still multiple questions to consider for the building of our first model for the relationship between words. For instance:

how big should the parameter n be in our n-gram model ? We will need to evaluate the memory and processing power needed for large n-gram models to fix n.
How do we handle unseen n-grams ? We can probably use a backoff model. The idea is to backoff to a (n-1)-gram model if a sequence of words of length n is not found in the n-gram model.

Once we get an accurate and efficient predictive model, we will develop a shiny web application. The user will be asked to enter some text and the application will then suggest words that most likely follow what the user has typed in.

Building a Natural Language Processing Algorithm

First Step: Text Mining

Benjamin Rouillé d’Orfeuil

February 21, 2017

Overview

DataSet

Preprocessing

Sampling

Cleaning

Analysis

Unigram

Bigram

Trigram

Next steps

Building a Natural Language Processing Algorithm First Step: Text Mining

Benjamin Rouillé d’Orfeuil

February 21, 2017

Overview

DataSet

Preprocessing

Sampling

Cleaning

Analysis

Unigram

Bigram

Trigram

Next steps

Building a Natural Language Processing Algorithm

First Step: Text Mining