Data Science Capstone - Milestone Report

Synopsis

The goal of this report is to display what I have gotten used to working with a specific data and what I am on track to create my prediction algorithm regarding this data. It is the milestone report of the data science capstone course.

Major features of the data

The data is from a corpus called HC Corpora. It is collected from publicly available sources by a web crawler. The crawler gets texts consisting of different desired languages, including English. Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper, tweets, or personal blog). For English there are the following three files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Below is a table with some statistics of the data contained in the previous files:

File	Number of lines	Size in memory
en_US.blogs.txt	899,288	248.5 Mb
en_US.news.txt	1,010,242	249.6 Mb
en_US.twitter.txt	2,360,148	301.2 Mb

Below some interesting findings that I observed in the data:

There are a lot of words with English spell problems, some of which are written in another language.
There are a lot of colloquial language words like “haha”, “wow”, “yeah”, “yep”, “ohh”, and others.
Some of the colloquial words are even written differently as for example “haha”, “hahah”, “jahaha”.
A lot of non-printable characters and emoticons are observed, overall in tweets.

Plan for creating, testing and using the prediction algorithm

Read file lines. Due to the fact that there are a considerable number of lines (see previous table), for training reasons I will read a percentage of the total of lines randomly selected.
Decompose the lines into tokens (tokenize). By doing some research I found the package “tokenizers” that can be used for this reason.
Filter the tokens. Clean data is a necessary but not sufficient condition to developing a prediction algorithm (G.R. Gendron, 2015). All words that are not suitable to be predicted has to be removed from the list of tokens, in particular those that have English spell problems. By doing some research I found the package “hunspell” that can be used for this reason. On the other hand, it is neccesary to avoid capitalization or uppercase in order to normalize the tokens for search purposes. Therefore, I will write all tokens in lowercase.
Find a way to represent a n-gram. By doing some research I found the work of Michael Szczepaniak with an idea in which the words of an n-gram are concatenated with some character (for example, ’_’) and represented as a string. For example, the two 3-grams of the text “This is a text” are “This_is_a” and “is_a_text”.
Obtain the n-gram from the tokens and build a corresponding data frame. Based on the list of tokens, in this step the n-gram are obtained to build a training data frame for predictions reason. The data frame will have the following three variables:
- ngram: the words that appear before the outcome word.
- word: the outcome or word to be predicted because the previos words (n-gram).
- freq: frequency of the secuence (ngrams -> word).
In order to avoid to have n-grams duplicated in the data frame, a new n-gram has to be searched first. It is the reason to have the variable “freq”. It is important to note that as n-grams are added into the data frame, it grows and the search slows down the process more and more. To handle this situation I will implement a mechanism that allows the search in parallel in different sections of the data frame. I’m thinking of having several fixed-size data frames and then concatenating them at the end to get a single one. To do this I can use a function as “sapply” on the set of data frames that there are at that specific moment.

On the other hand, in order to have different kind of source of the information, I will use a small percentage of lines (5% for example) of each file. Therefore the learning data frame could have data from the blogs, the news, and the twits.
Build the prediction model. The prediction model I am going to build is focused on the prediction of “word” (the outcome) based on the “ngram” (the n-gram) and its frequency (“freq”). I will build several models in order to compare both the level of prediction and the learning execution time. I will use random naive-bayes, decision trees, generalized boosted regression, and random forest. My intention is to create models for the three cases: 1-gram, 2-gram, and 3-gram.
Test the prediction model. In order to test the prediction models previously obtain I will build a data frame with the 0.05% of the lines of each of the three files. Depending on the results I will decide the best model looking for a right balance between successful prediction and learning time.
Using the prediction model. Depending on whether the execution time is acceptable, to use the prediction model in a real scenario I am thinking of executing the three models previously trained and tested for 1 <= n <= 3.The three possible words will be shown as a suggestion of the prediction in the order 3-gram, 2-gram and 1-gram. In case of repetitions it will be given greater importance in the answer and therefore it will be placed as the first option.
The Shiny app. My idea with the shiny app is to have an edit control in which a user can write a text. The text is going to be analyzed and processed dynamically. The corresponding tokens will be obtained. Depending on the number of tokens that were written, the prediction model will be executed for 1, 2 and/or 3-gram and the suggested prediction words will appear in a popup menu. The user will then be able to select a word from the list that will be concatenated to the text already written and the user will then be able to continue writing the text.

I will also plan to place a simulation section of the prediction process, in which the user can write some words and execute the model using a button to obtain and visualize the corresponding result.

Some statistics

By doing the first five steps presented in the previous section, the following table present some metrics. Notice that only a 5% of the lines of each file were taken.

File	Number of lines	Number of tokens	Size in memory (Mb)	Execution time (min)
en_US.blogs.txt	44,964	1,347,548	75.6	1.9
en_US.news.txt	50,512	1,351,047	76.2	2.1
en_US.twitter.txt	118,007	1,293,546	76.7	4.6

The word with the highest frequency (101,474 times) is “the” and there are 14,342 words with the least number of frequency which is 1. Plot 1 shows the words that appear more than 5,000 times (frequency > 5,000) and Plot 2 shows the wodws with a frequency between 2,000 and 5,000. There are 51,009 words with a frequency less than 2,000.

Plot 3 shows the word frequencies sorted in decreasing order. The x-axis is an index. Notice that most of the words have a low frequency.

Summary

My experience working with the data from the corpus HC Corpora is presented in this report. Data is going to be used to train a prediction model related to natural-languaje processing. A detailed plan for creating, testing and using the prediction algorithm is also presented in this report. It is important for a data scientist to make this type of plan when it comes to addressing a complex problem as in this case.

References

Gendron, Gerald R. (2015). Natural Language Processing: A Model to Predict a Sequence of Words. MODSIM World 2015. Advailable in http://www.modsimworld.org/papers/2015/Natural_Language_Processing.pdf
Szczepaniak, Michael. (2016). Understanding the Katz Back-Off Model. Advailable in https://rpubs.com/mszczepaniak/predictkbo3model