Summary.

The main goal of this project is to use R to design an application that starting from an input text will predict the next word in the input.

This is a non-technical summary, intended for a general audience.

Obtaining the data.

To predict anything we always need two basic ingredients: data and a prediction model. In this report we are gonig to concentrate in the analysis of the data, leaving the modeling part for a later stage of the project. We will only include some comments about the modeling strategy at the end of this report.

The Capstone Project Dataset is downloaded as a zip file. It is a big file, over 500Mb, containing three large files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

For further information about these files, please see http://www.corpora.heliohost.org/

Data Files Exploration.

The text files in the data are intended to be a representative sample of common English written text in the media. We are going to analyze the structure of the sentences in these files, in order to build our models upont the results of the analysis. In order to familiarize ourselves with the data, here is some basic info about the files.

The file sizes (in MBytes) are:

##    Blog file    News file Twitter file 
##        200.4        196.3        159.4

The numbers of text lines are:

##    Blog file    News file Twitter file 
##       899288      1010242      2360148

The text lines in the twitter file are of course limited in size. But in both the blogs and the newsfiles, the lines of text can be very long. The maximum lengths of the lines in the text files are the following:

##    Blog file    News file Twitter file 
##        40833        11384          140

A more representative idea of the length of the lines is however provided by the median length of the lines in the files:

##    Blog file    News file Twitter file 
##          156          185           64

Data sampling.

As we have just seen, the text files are very large in size. Trying to use the whole set of data would make the model construction too slow in terms of execution time, and too big to fit in memory in most computers (not to mention mobile devices). Therefore, we have selected a random sample of the data to build the model upon it: a training data set. Besides, this will allow us to use the rest of the data as a test data set for the accuracy of our model. The strategy for the sampling consists in putting together all the text lines in the files and taking a random number of those text lines.

The total number of text lines in the data is 4269678 (over 4 million) but we will be considering a much smaller sample, consisting of a given percentage of that total number of lines (a 0.5% of the number of lines, giving a sample size of 2.134810^{4} lines of text). Sampling is random and with replacement to keep the sample representative of the whole data set.

Data cleaning.

These text files are useful because they represent normal English text, as found online. However, that also means that they include all the kind of things that you expect from online texts, such as tweets and similar sources. We will have to deal with typos, nonsense text, foreign words, profanity, special structures such as urls, etc. Therefore, our first task is cleaning the data to make it amenable for the model building part of the project. Of course, some of the cleaning decissions made below can and will be revised as the model is built, to asess their impact on the model performance.

Tokenization.

In Natural Language Processing (NLP), Tokenization refers to the process of breaking a txt up into its components (tokens), such as words. In this process, the notion of token depends on the goal of the analysis, and the data cleaning is an integral part of this tokenization process. For this exploratory part of the project we begin with a quite crude version of the tokens, in that we start by:

Removing punctuation.
Removing numbers.
Converting all data to lower case.

To tokenize the data we use the infrastructure provided by the tm package in R (see tm). Technically, the text data is converted into a data structure called a corpus to carry out the tokenization. These corpus is made of so called documents, in this case a document for each line of text in the sampled data. For example, the content of the first document, previous to the tokenization process is:

## [1] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

And as you can see, it contains punctuation, upper case letters, symbols such as $, etc. After arrying on the above steps the result is:

## [1] "chad has been awesome with the kids and holding down the fort while i work later than usual the kids have been busy together playing skylander on the xbox together after kyan cashed in his from his piggy bank he wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it he never taps into that thing either that is how we know he wanted it so bad we made him count all of his money to make sure that he had enough it was very cute to watch his reaction when he realized he did he also does a very good job of letting lola feel like she is playing too by letting her switch out the characters she loves it almost as much as him"

Handling profanity.

To deal with profanity (and in general with any collection of undesired words) we are going to use a precompiled list of words from the Luis von Ahn Research Group, which is available online at the following address:

http://www.cs.cmu.edu/~biglou/resources/bad-words.txt

The cleaning process will look for the lines in our sample data containing any of these words, and will remove the word from the text,

The list in this file can of course be replaced with any other suitable list of words to be removed from our data, but this one will do as proof of concept.

Further cleaning of the data.

Besides, we are going to perform some other cleaning operations on the sample data. The basic idea is to define some patterns (technically, regular expressions) that we wish to remove from our data.

How does that work? For example, it’s safe to say that any word containing four or more consecutive vowels can be removed from the data. Similarly, any word with six or more consecutive consonants may be removed (see the reference and the exceptions in http://www.fun-with-words.com/word_consecutive_letters.html#Consonant_Sequence). In a later phase of the analysis, further patterns can be identified, and the (expectedly small) impact of removing these patterns will be asessed.

Besides, we would like to remove the “non-english characters” (like the chinese 坁, Spanish ñ, the french ô, etc.)

Removing whitespace.

Finally, after all these operations, our sample lines of text data will be left with quite a lot of whitespace, due in part to the parts that have been removed. And another important cleaning operation consists of removing whitespace from the beginning or the end of a sentence, because those spaces can interfere with some parts of the analysis (e.g., with the count of the number of words in a sentence).

An additional step after removing all the above from the data is to ensure that there are no empty lines of text left in the sample data.

Exploratory Analysis of the Data.

Frequency Analysis of Words in the Data.

After the cleaning operations have been carried out, the sample text data is ready for exploration.

For starters, we can ask for the most frequent words in the data. The following table shows the most frequent words in our sample. More precisely, the table contains in decreasing order the frequencies of the words that appear more than 1500 times in the data.

##   the   and   for  that   you  with   was  this  have   are   but   not 
## 23171 11854  5360  5036  4658  3482  3165  2641  2640  2560  2397  2033 
##  from   its  will  they   all about  just   his  your 
##  1825  1792  1647  1623  1598  1522  1519  1516  1508

Graphically:

It is also interesting to take a look at the whole picture of the frequency distribution of the words in the data (which is, somehow, the reverse of the previous picture). You can see that almost all of the words appear only a few times in the data. For graphical purposes we have limited this to include words that appear at least $100$ times in the data, and you can clearly see in the picture that the frequency distribution is extremely skewed. The small bumps in the right tail correspond to the most frequent words that we have identified before, such as “the” or “and”:

It comes as no surprise that the most frequent words are the so-called stopwords because these words serve as basic building blocks for English sentences. In many areas of NLP removing the stopwords is a necessary step of tokenization. However, for this particular application, I think that is better to keep them. A useful text predicting model must be able to predict these stop words, precisely because they represent such a big fraction of the users text input.

Frequency analysis of n-grams in the sample data.

In the context of NLP, a n-gram (see Wikipedia) is a contiguous sequence of $n$ tokens; think $n$ consecutive words in a sentence. Many NLP models make extensive use of the analysis of the n-grams appearing in a corpus of text, and this becomes specially important in text predicting appplications. Thus we turn now to the analysis of the n-grams in our sample data, for different values of $n$.

The following function can be used to extract the n-grams from a character vector.

Let’s see how n-grams work. We take any of the sampled text lines in our data and extract, e.g., the 3-grams. This is the text:

## [1] "some of last night and today was tense no idea why"

and these are all the possible 3-grams in that sentence:

## [1] "tense no idea"   "last night and"  "night and today" "of last night"  
## [5] "was tense no"    "today was tense" "and today was"   "some of last"   
## [9] "no idea why"

To analyze the n-grams distribution in the English sentences in our data we apply this method to extract all the n-grams for each sentence in the sample data, for some values of $n$.

Let us begin with $n=2$ (as $n= 1$ would bring us back to words)). The 10 most frequent 2-grams in the sample data appear in the following table:

Similarly for 3-grams we get:

## grams_n_vec
##     one of the       a lot of thanks for the        to be a    going to be 
##            176            139            101            100             85 
##     be able to       it was a     as well as     out of the   cant wait to 
##             82             75             72             65             62

Finally for 4-grams:

## grams_n_vec
##      at the same time thanks for the follow       one of the most 
##                    38                    31                    29 
##      cant wait to see    for the first time        is going to be 
##                    28                    28                    27 
##         is one of the       the rest of the         to be able to 
##                    26                    26                    25 
##        the end of the 
##                    22

In all cases, the analysis of the n-grams frequencies indicates that it is necessary to go beyond a simple n-gram search, since a vast majority of the n-grams appear only once in the data, as illustrated in the following figure in the case of 4-grams. This emphasizes the importance of using the appropriate model for prediction.

Tecnical side note: I am using the ngram library for this part of the analysis (see ngram). A more popular choice for this is the RWeka library (see RWeka). However, the R code supporting these analysis has been tested in Windows, Linux and Mac machines, and I have found many compatibility issues between this library and the Java versions in the test machines . I have managed to make RWeka work in Windows machines, but I’ll try to carry out the rest of the model construction without using this library, to increase the portability of the code.

Final Remarks.

The exploratory data analysis in this report is just the first step in the model building process. The next step is to obtain a simple n-gram model (see Wikipedia) from this data. Some other ideas for the rest of the project are the following:

The initial exploration of the data has not taken typos or foreign language words into account. The first problem may be approached using regular expressions with agrep and an English dictionary of words. A dictionary based approach may be useful also for foreign languages. It is possible, however, that the incidence of both problems upon the final model may be small (this has still to be confirmed).
Additional sources of text can be easily incorporate into this framework, to see if this results in an increased accuracy and coverage of the model.
An essential part of the remaining work is the study of the dependence between the accuracy of the model and the sample size.

Milestone Report

Sarathy Jay

April 28, 2016