Synopsis:

The last project within the Johns Hopkins’ Data Science Specialization (capstone course) is a simulation of the real world environment in which a Data Science Specialist develops his brand new projects.
The concrete goal of this one is to build a data product capable of predicting the next entry word into a mobil device based on the previous typed-in words, using machine learning techniques applied to the Natural Language Processing field.
This report summarizes the first carried out activities: getting, cleaning and understanding the data, exploratory data analysis and an initial approach towards modeling the application, oriented to a non-specialist reader, and that is the reason why it’s been recommended to keep the code and technical considerations in this presentation reduced to a minimum, whilst showing knowledge of the data characteristics, and summarizing them in tables and plots.
The data we have to work on has been provided by SwiftKey, and this report is based in the english language corpus (although I personally missed a spanish version).

Main steps of the data study:

- Download the data and save it in local files

The data was read from the URL given in the documentation of the project:

unpacked and saved on directory “./Datos” on the working environment. There are three files there: one contains blogs, the second news, and the third twitters.
The files were read line by line, while extracting basic information from them. It is important to use the appropriate parameters on the reading instructions, so as not get interrupted by alien characters (present in three places of the news file); to skip nulls, and to format strange characters adequately to future processing (making the unicode codes of non textual characters explicit).

- Extract a random sample of each one of the three files

The read data has the following characteristics:

  • Original Data:
             Size # of lines Longest line # of words Mean.words/line
Blogs   210160014     899288        40835   38171210        42.44604
News    205811889    1010242        11384   34797994        34.44521
Twitter 167105338    2360148          213   30657971        12.98985

Since the total size of the original files (583077241 bytes) is quite big for a PC’s memory and speed, and, furthermore, since it is only necessary to work with a reduced size random sample to get, in accordance to statistical inference, quite acceptable results, a process was implemented to get approximately a randomly obtained 10% of the lines from the given data. These sampled files - we are going to call each of them with the name of the original from where each one was extracted - are described in the following table:

  • Sampled Data:
            Size # of lines Longest line # of words Mean.words/line
Blogs   20759205      90210        10785    3786070        41.96952
News    20323363     100633         1914    3444766        34.23098
Twitter 16499901     235626          764    3068024        13.02074

It can be seen that the size in bytes, in number of lines, and in number of words of each file has been reduced to approximately 10% of the original. On the other hand, the mean number of words per line on the News and Twitter sampled files is quite close to the original’s number, as it should be. However, it is not so close in the Blogs file.
Furthermore, what is rather anomalous is the maximum line length in the Twitter file: while in the original file it is 213 bytes (remember that the tweets in general could not excede 140 characters until last year, but any URL counts only as 22 characters and there are other technical considerations to consider in that count), in the sampled file it has grown to 764 bytes. How is this possible?.
The reason for these anomalies is that the sampled files were read making the unicodes of non textual characters explicit. For example, as we’ll see in some examples below, three non english characters may be replaced by a string of 28 bytes. These characters are absent from the News file (more formal), while there are some in the Blogs, and much more in the Twitter files. The replacement of some characters by its much longer codes explains the noticed observations.
Nevertheless, this code-explicit form of reading the samples allows us a better control of content, and an easier way to eliminate those characters from the future predictors list.

The blogs are the documents containing more words, while they also have less lines. But the meaning of lines in this context should be understood as paragraphs for the blogs and news, and tweets in the Twitter case. So, the blogs have less paragraphs than the news, but they are longer. And the tweets are many more that the paragraphs in the other two types of documents, but the mean number of words per tweet is quite little compared with the same statistic per paragraph of blogs and news.
This analysis seems to be in accordance with what we know about these kind of documents, and does make sense, verifying thus the quality of the data and the performed reading and sampling.

- Load the samples into a working corpus

Using a text mining package (tm), the three sampled files were built into a corpus, that is, a collection of documents to be preprocessed, in general, with a common criteria. Since the predictive application we are planning is not oriented to a specialized type of people or form of expression, but to common use english, we have chosen to mingle the three sampled files in only one bunch. However, if necessary, we can access lines at specific locations in any of the three components of the corpus.

- Perform over the corpus transformations oriented to the purpose of the work.

Once the corpus is built, it will be transformed adequately to eliminate profanity, suppress unnecesary white spaces, remove punctuation, numbers, and remove the unicodes of non textual characters - adequately formatted during the initial reading - since neither of these are intended to be predicted by the working application. Also, all letters are changed to lower case. On the other hand, we decided not to remove stopwords, since these may be important on a predictive system (consider, for example, articles and prepositions).

All of these facilities, provided by the tm package, act over the documents in the corpus as a whole.
Let’s see the effects of this transformation on some lines of the sampled set, as they were initially, and in their post-transformed state:

Twitter line 4072.- Before tm_map transform: 
I have the best sistaa <f0><U+009F><U+0098><U+0098><U+0093>: Making home made cookies for my sissy! <U+0094>
After tm_map transform: 
i have the best sistaa  making home made cookies for my sissy 
Longest Twitter line.- Before tm_map transform: 
 <f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>
After tm_map transform: 
  cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake 

The profanity elimination part is provided by tm with a special call to tm_map, using the given function removeWords. Previously, it is necessary to build or download a list of undesirable words, called badwords by us. In our particular case, these would be words we don’t want to see appearing as the next input in the project’s predictive application.

The removeWords function removes words from a text document, and for our corpus we should call it as: {r corpus <- tm_map(corpus, removeWords, badwords[,1])}
However, its performance was not satisfactory for this project, since didn’t change many instances of each of the banned words from the list. As an example, using two four-letter words as our list, we found 4555 appearences in the sampled corpus. From them, removeWords took off only 1842 (40.4%).
Let’s see this with some detail, since it took a good deal of our time to try to fix this problem.
I wrote a very simple substitute of removeWords, calling it noBadwords, and used it in this way:

noBadwords <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
for (bw in badwords[,1]){ 
    corpus <- tm_map(corpus, noBadwords, bw)
}

This worked better than removeWords, but it wasn’t perfect either: It got removed 3334 (73.2%) of the bad words’ instances, but left 1221.
Any intent of improving this, using more complex patterns than the word itself, was unsuccesful.
Let’s see, as an example, what happened through the whole transformation process with a repetitive instance of a rather innocuous word from the list: damn.

Blogs line 5169: 
 Damnit I heard her whisper. Damnitdamnitdamnitdamnitdamnit. I can't deal with this. Not right now.
After tm_map transform:
 damnit i heard her whisper damnit it it it it i can't deal with this not right now

Distribution of words and relationship between words.

Trough exploratory analysis on our data, let’s try to find out the distribution of words and the relationship between the words of the corpus. We’ll use the package RWeka, with special facilities for building sets of relationships, and compute from de corpus the frequency corresponding to each word, each pair of words, and each triplet of words. This data shall be necessary later for our model, approximating the probabilities of sequences of words via Markov chains.
But these computations require a lot of RAM, which depends on the size of the corpus. The 426469 lines of our built corpus caused a “garbage collect overhead limit exceeded” on the 4 Gigabytes RAM’s used laptop.
On the other hand, the results are not deeply affected by a reduction in that size. So, we sampled the previous corpus reducing it to 25% in size, and obtained the following barplots that show the most frequent dozen of each type of association (one, two, three and four words), sorted in descendent order by its frequency:

Another important fact to learn about is the coverture of the corpus by N-Grams having a minimum or maximum frequence. We want to know, for example, which percentage of the corpus is covered if we take the words with a frequence not smaller than 50; or not greater than 10.
The following plots, computed for percentages with a granularity of 0.5, address this requirement for words, bigrams, threegrams and quadgrams.

Plans for a prediction algorithm and Shiny app.

The described process has allowed us to count at the moment with a corpus and, what is more important, with a data base of its N-grams and their frequencies, which will be the principal tools to build upon the predictive language application.
But to build the real probabilistic model is a complex matter, and at this instance we can only imagine a path towards the final product.
In accordance with the read material, the forum’s threads, and what we have experienced during the exploratory analysis, the maximum likelihood (ML) simple estimates are not enough because of data sparsity. It becomes necessary a smoothing method, and there are several to choose from.
As far as I can see now, considering that our data base is not large, the modified Kneser-Ney smoothing (an interpolation algorithm) seems to be a good choice. It will depend on how possible is to implement it in view of the limitations of time, the hardware, and the speed and accuracy goals the application should fulfill.
As for the Shiny implementation, it’ll probably be a simulation of a mobile’s keyboard to enter the words, and a screen that keeps showing the next word prediction. The engine with the algorithm will reside in the server function, while the ui function will handle all the graphic interface. The performance of the system will depend in part of the shinyapps.io server.