Jose Antonio Garcia Ramirez
December 23, 2017

Coursera Data Science Capstone Project (NLP)

Get data
Data wrangling
Data processing
Model and demo product

Get data

We have three large files (sets of words) here.

After unzipping, we have the following

File	Lines	Max length of line (characters)
blogs	899,288	40,833
news	77,259	5,760
twitter	2,360,148	140

Data wrangling and processing

Following the work of [1] we perform the following transformations to clean the data:

Remove whitespaces, punctuation and numbers.
Transform to lower letters
Remove stop words like a, an, by …
stem words

[1]: Pengda Qin, Weiran Xu and Jun Guo, 'A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks,' in Advances in Knowledge Discovery and Data Mining 2017, Springer.

Due to the limitations of the computer equipment, divided the files of blogs and news in 8 parts each, then we process (extraction of n-grams) in the end we join the results of each part.

etl

Model

Finally, using n-grams, we can predict the next word of a sentence (the implementation predicts from a word to a line of any length).
Using the maximum likelihood estimator (a.k.a. the word with the highest number of occurrences in our Corpus)

Demo product

A version (less powerful than the original with all the data) of the natural language processing model is found in this app
In the first tab (Guess word):
1. You must enter the phrase (in the text box).
2. The entry is confirmed and the typescript is displayed
3. In the bottom the best prediction for the introduced phrase is displayed
In the secon tab (Another guessesAnother guesses)
- If you are not satisfied with 'the best prediction' of the first tab, the second tab shows a table with other prediction possibilities (the field 'frequency' indicates the order of the options)