1. Summary

This report is the second part of the Natural Language Processing project, the final Capstone of the Coursera Data science Specialization by Johns Hopkins University. The project is to build a text predition app with input words from the users. In the previous report, I summarized intial statistical findings from the data. This report introduces my approach in building the prediction model and valuable findings on the prediction optimization.

Before reading this report, you are encouraged to have a look at the previous one at: http://rpubs.com/nhohung/NLP_processing. The final app is published online at: https://nhohung.shinyapps.io/TextPrediction/. A short presentation of this project is posted on: http://rpubs.com/nhohung/NLP_summary.

2. Data processing and initial summary

The data contains 3 English text files from blogs, news and twitter, as summarized below:

File source	Size (in MB)	Line count	Word count
blog	200.42	899,288	37,334,641
news	196.28	1,010,242	34,372,792
twitter	159.36	2,360,148	30,373,906

I have cleaned the data and calculate different n-grams after merging all 3 to 1 single file (this big file has 4,269,678 lines and 102,081,339 words). The process includes 3 steps (or technical implementation details, please see previous report):

Step 1: Convert our merged regular text data to text corpus. Basically this process (1) breaks all lines into words, (2) summarizes word types, number of word and sentence in each line, and (3) stores all these information in a data frame as the preparation for the next step.
Step 2: Tokenize the corpus and perform data cleaning. Compared to corpusing, tokenization is a high-level process that can generate n-grams output (whereas corpus only produces ‘separate words’ or ‘unigrams’). The quanteda package is very handy as it supports useful data cleaning functions during tokenization. In this project, I will remove the following components from the data: number, punctuation, hyphens, symbols, url and twitter tags.
Step 3: Create the Document-feature matrix (dfm) of the n-grams. This matrix contains all statistics of the n-grams that we need to obtain the n-grams frequencies.

An example for trigram (phrase of 3 consecutive words) with highest frequency extract from the dfm is demonstrated below:

I have found that on a Windows computer with 16 GB or RAM, I still have to set a 32 GB page file on the SSD and constantly removing objects from the workspace. So, large data is very expensive in processing cost. According to my experience, the most time consuming part of this processing routine is the tokenization (with cleaning integrated). However, the part that eats up memory is the dfm calculation (lots of insufficient memory errors).

3. Prediction model strategy

3.1. Prediction model determination

I will build a prediction model that provides some suggestions for the user when he/she puts in some text. To break down the problem, this is how it works:

The user type in n word(s) as an input
If n = 1 (1 input word), the model suggests the second one from the 2-grams that starts with the 1 input word and has highest frequency
If n = 2 (2 input words), the model suggests the third one from the 3-grams that starts with the 2 input words and has highest frequency
If n = 3 (3 input words), the model suggests the fourth one from the 4-grams that starts with the 3 input words and has highest frequency
(and so on)
In any of the case above, say input n-1 words, if the model cannot find a suggestion (not seen n-grams), it will take the last n-2 words from the input and give suggestion from the n-1 grams. If the n-1 grams cannot find a suggestion, the n-2 grams will be searched for, down to 2-grams.

3.2. Model standards

In this project, below are my app standards:

Give 5 suggestions instead of 1
Response time must be negligible (less than 1 second), including initiating the app and word prediction
When there is no suitable suggestion, returns NA

Below are 2 points I want to clarify when delivering this model to you:

5 suggestions: In practice, giving 1 prediction almost never matches what the users want. We think about different things when typing, and a n-gram-base prediction is impposible to guess what we think. 3 or 5 suggestions are common options available on real apps (google keyboard, swiftkey). Giving 5 suggestion increases the prosibility of matching with the users thinking.
Returning NA when there is no match: The last standard point I made is the idea of Back Off model as I’m not building a smoothing function for this project. Dealing with not-seen n-gram requires some machine learning techniques that is not the scope of this Specialization. Moreover, it is normal to say there is no prediction because of the variety of what the users want. I would suggest nother rather than giving non-sense prediction. In this project, the Back Off model has actually improved the case of not-seen input by using a lower n-gram prediction, so if the n-grams have no match, n-1-grams and lower are used to give suggestions.

You will see in the real app, how effective these two points I made.

3.3. Technical details

I have sucessfully build the prediction model. However, the first version was heavy and run very slowly. The final model is almost 300 MB and matching time for each of user input is about 30 seconds! I have researched different sources, intensively tweaked the model and finally get a suitable model size of 4 MB and almost instant matching time. Below are very important details that I have found when making it works, efficiently:

Using 4-grams model gives best experience. With 10% of the data in the validation, the accuracy of 2-grams, 3-grams and 4-grams models are 16%, 23% and 27%, respectively. This indicates that using 3-grams prediction may be theoretically adequate, but my practical tests show a better one with 4-grams prediction. Although building 4-grams model takes more resources, but for this project, I use back-off 4-grams prediction.
The n-grams model can be highly truncated by removing tokens with low frequencies. Users have an universal similar way of speaking/writing, and tokens with too few frequencies may be rare in regular text input. Plus, the app gives suggestions, not writing the whole content for the users. Therefore it is reasonable to remove them. In this project, I leave out all tokens with appearance frequency less than 10.
With similar tokens (n-grams token with same n-1 starting words), I only keep 5 tokens with highest frequency counts. Because the app only gives 5 suggestions from the same n-1 starting words, keeping lower frequency tokens that will never be suggested is a waste.
Do not store token count/frequency in order to reduce the size of the model. Because this project uses Back-off model without interpolating between actual token counts, there is no need to keep these information.
To reduce the time response, there are 2 points:
- First, use a data table format. After loading the model, the R session needs a certain time to read the data to be ready to start working on it. Reading a data table file is faster than reading a data frame (reading time, under this definition, is the time to be ready to work on the data, not the loading time).
- Second, use a proper matching function if the model is long (millions of line). Regular searching functions like grep (grepl, gregexp…) command is very slow. Using aggregate in dplyr does not help very much. Indexing hash or data table index is fast but takes a long time initially when creating the hash map or index map. In this app I use the function startsWith combined with which to search for text. This runs amazingly fast and provides very responsive user experience.
With this project, I divide the data into 2 parts: the first part is used to build the model, and the second one is used to test the prediction. The evaluation process can be done by first take random n consecutive words from each line of the test set as the input and the next word as the true value. Then, the n+1 grams model is used to predict the next word given the n-words input. If the correct word is in the 5 predictions output, that is a correct prediction. The table below summarizes prediction accuracy of 2-grams, 3-grams and 4-grams models with different size of test data:

n-grams	0.01%	0.1%	1%	10 %
2-grams	17.56	14.82	15.85	15.70
3-grams	23.65	22.53	22.70	22.63
4-grams	23.88	27.80	26.89	26.98

4. Technical implementation

4.1. Data processing (continued, inherited from part 1)

Each data file is shuffled (using data_shuffled <- sample(data, length(data)) where data is each of the 3 files) and then I select 90% of lines of the each data and merge train/test together to build the train/test datasets.

With the training dataset:

Create a single training corpus, tokenize and create dfm (unchanged from the previous report)
Remove tokens with frequencies less than 10 using dfm_trim and sort the top features using topfeatures (say named topfeature10)
Convert the data to data.table format, keep only the highest 5 n-grams tokens for tokens starting with the same n-1 words, using topfeature10[, .SD[1:5], token_starts]. This process introduces NA to tokens which have less than 5 duplicated starting words, then using na.omit(topfeature10) to remove them.
Remove profanity and some left over non-English characters (like these: Ã¢, Ç½) from the text. To remove, using startsWith, endsWith and which for best performance.

It must be noted that building the dfm of the 4-grams crashes my computer (Windows 10, 16 GB of RAM, 32 GB of page file on SSD). Therefore I (1) divide the merged file into 5 smaller parts, (2) calculate the corpus, tokenization, and dfm on all of them and (3) merge them together before dfm_trim. Later steps are applied as mentioned above.

4.2. Model building

Then the prediction Back-off model can be built as outlined in session 3.1. That is the core part of the whole project and I leave it for the readers (I believe you can make it work, to deploy it to shinyapps efficiently, see session 3.4).

4.3. Model validation

With the test dataset, only a simpler processing is required (remove profanity and left-over non-English characters), then it can be used as depicted in the last point of session 3.4

5. Results

In the end, I obtain a 3.77 MB model file encompassing the highest 5 bigrams, trigrams and quadrigrams that have appearance frequency more than 10 from the training set.

NPL for Text prediction, Word prediction report

Hung Dinh

June 24, 2018