Overview

We aim at developing a machine learning algorithm to predict the next word given a word or phrase. To achieve this goal, the first step is to build a language model by ingesting a corpus of documents. These documents are used to understand the distribution of words and how they are put together. For this reason, this exploratory data analysis is focused on extracting n-grams - sequence of words of length n - from the corpus.

We first load the data and provide a basic summary of the raw dataset. To speed up the analysis, we generate a random sample of the original dataset. This representative subset of the data is then cleaned. More specifically, we filter profanity and entries we don’t want to predict such as punctuation. We then move to the analysis of the data and extract n-grams to understand the frequency of single words and sequences of words encountered in the corpus. In the last section, we lay down plans for the building of the learning algorithm and the development of the associated shiny application.

DataSet

Three text files are available in the english corpus. A summary statistics is given below.

dir <- "../../final/en_US/"
files <- paste0(dir, list.files(dir) )
nFiles <- length(files)

info <- function(fileName){
    size <- file.info(fileName)$size / 1024^2 # size of the file in MB
    
    wc <- unlist(strsplit(system2("wc", args = paste("-lw", fileName), stdout = TRUE), " ") )
    nLines <- (wc[wc != ""])[1] # number of lines in the file
    nWords <- (wc[wc != ""])[2] # number of words in the file
    
    df <- data.frame(as.numeric(size), as.numeric(nLines), as.numeric(nWords) )
    colnames(df) <- c("size", "nLines", "nWords")
    rownames(df) <- sub("^.*/", "", fileName)
    
    return(df)
}

for (i in seq(nFiles) ) {
    if (i == 1) files.info <- info(files[i])
    else files.info = rbind(files.info, info(files[i]) )
}

library("knitr")
kable(files.info, format.args = list(big.mark = ","), digit = 0, col.names = c("Size (MB)", "Lines", "Words") )
Size (MB) Lines Words
en_US.blogs.txt 200 899,288 37,334,690
en_US.news.txt 196 1,010,242 34,372,720
en_US.twitter.txt 159 2,360,148 30,374,206

The files are pretty large. For this reason, we will consider only a small sample in this analysis.

Preprocessing

Sampling

We randomly sample 5,000 entries from every document available and save the extracted lines in a file on disk.

set.seed(128) # for reproducibility
nSample <- 5000
sampleFile <- function(fileName) {
    keep <- sample(info(fileName)$nLines, nSample)
    file <- readLines(fileName, skipNul = TRUE, encoding = "UTF-8")
    sample <- file[keep]

    return(sample)
}

for (i in seq(nFiles) ) {
    if (i == 1) selected <- sampleFile(files[i])
    else selected <- rbind(selected, sampleFile(files[i]) )
}

writeLines(selected, con = "sample.txt")
sample.info <- info("sample.txt")

This file is only 2 MB and encloses 15,000 lines and 439,363 words.

Cleaning

Let’s start by lowercasing all characters.

clean <- tolower(selected)

We now remove, urls, email addresses, hashtags and twitter usernames.

clean <- gsub("http\\S+\\s*", "", clean); clean <- gsub("www\\S+\\s*", "", clean)   # urls
clean <- gsub('\\S+@\\S+', "", clean)                                               # emails
clean <- gsub('#\\S+', "", clean); clean <- gsub('@\\S+', "", clean)                # twitter

In the next chunk, we now extract profanity and offensive words. The list has been downloaded at this url.

badwords <- readLines("badwords.txt", skipNul = TRUE, encoding = "UTF-8")
for (word in badwords) clean <- gsub(word, "", clean)

Finally, we remove numbers and punctuations. Also, possible multi/trailing spaces need to be removed/collapsed. This can easily be achieved using the ngram package. Functions from this package manipulate a single string. For this reason, we first concatenate the corpus.

library("ngram")
words <- concatenate(clean)
words <- preprocess(words, remove.punct = TRUE, remove.numbers = TRUE, fix.spacing = TRUE)

Note that we did not remove stopwords (words such as the, also, and, …) from the corpus. Although these words have little significance, they are extremely frequent and we believe it’s important to consider them in this project. Our model will heavily rely on the n-grams, i.e., an ordered sequence of words pf length n, extracted from the corpus. A vast majority of n_grams would be meaningless if stopwords were to be removed and, in turn, the model will perform poorly.

Analysis

The next step consists in retrieving n-grams from the cleaned corpus. The ngram package makes this easy. We also write in the next chunk some useful functions for performing n-gram analysis.

getNGram <- function(words, n, print = TRUE) {
    ngram <- ngram(words, n = n)
    df <- get.phrasetable(ngram)
    if (print == TRUE) {
        print(kable(head(df, n = 5), format.args = list(big.mark = ","), digit = 5,
              col.names = c(paste0(n, "-gram"), "Frequency", "Proportion") ) )
    }
    return(df)
}

library("RColorBrewer")
library("wordcloud")
getNGramWordCloud <- function(ngram) {
    wordcloud(ngram$ngrams, ngram$freq, max.words = 50, min.freq = 5, scale = c(2.5, .5),
              colors = brewer.pal(6, "Dark2") )
}

library("ggplot2")
getNGramFreq <- function(ngram, n = 15) {
    title <- paste0(n, " most frequent ", length(strsplit(ngram[1,1], " ")[[1]]), "-gram.")
    ggplot(ngram[1:n,], aes(x = reorder(ngrams, freq), y = freq) ) + geom_bar(stat = "identity") +
        geom_text(aes(label = sprintf("%1.2f%%", 100*prop) ), hjust = 1.25, colour = "white", size = 3) + 
        coord_flip() + labs(title = title, x = "", y = "Count")

}

We can now produce the (1, 2 and 3)-grams and give some basic statistics. Word clouds and bar plots are shown along with the tables.

Unigram

unigram <- getNGram(words, 1, print = TRUE)
1-gram Frequency Proportion
the 21,931 0.05149
to 12,154 0.02853
and 11,125 0.02612
a 10,551 0.02477
of 9,375 0.02201
getNGramFreq(unigram)
getNGramWordCloud(unigram)

It is worth noting that the 10 most frequent unigram represent 22 of all the unigram extracted.

Bigram

bigram  <- getNGram(words, 2, print = TRUE)
2-gram Frequency Proportion
of the 2,027 0.00476
in the 1,957 0.00459
to the 1,010 0.00237
on the 922 0.00216
for the 811 0.00190
getNGramFreq(bigram)
getNGramWordCloud(bigram)

Trigram

trigram <- getNGram(words, 3, print = TRUE)
3-gram Frequency Proportion
one of the 147 0.00035
a lot of 139 0.00033
to be a 92 0.00022
as well as 78 0.00018
out of the 73 0.00017
getNGramFreq(trigram)
getNGramWordCloud(trigram)

Next steps

The very next step is to develop an algorithm to predict the sequence of words. The n-gram models allow for the assignment of probabilities to sequence of words. Using the n-grams that have been extracted from the corpus, we can easily estimate the probability of the last word of an n-gram given the previous words. Markov Chains are a easy way to store and query n-gram probabilities. There are still multiple questions to consider for the building of our first model for the relationship between words. For instance:

Once we get an accurate and efficient predictive model, we will develop a shiny web application. The user will be asked to enter some text and the application will then suggest words that most likely follow what the user has typed in.