Jose Antonio Garcia Ramirez
December 23, 2017

Coursera Data Science Capstone Project (NLP)

  • Get data
  • Data wrangling
  • Data processing
  • Model and demo product

Get data

We have three large files (sets of words) here.

After unzipping, we have the following

File Lines Max length of line (characters)
blogs 899,288 40,833
news 77,259 5,760
twitter 2,360,148 140

Data wrangling and processing

Following the work of [1] we perform the following transformations to clean the data:

  • Remove whitespaces, punctuation and numbers.
  • Transform to lower letters
  • Remove stop words like a, an, by
  • stem words

[1]: Pengda Qin, Weiran Xu and Jun Guo, 'A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks,' in Advances in Knowledge Discovery and Data Mining 2017, Springer.

Due to the limitations of the computer equipment, divided the files of blogs and news in 8 parts each, then we process (extraction of n-grams) in the end we join the results of each part.

etl

Model

  • Finally, using n-grams, we can predict the next word of a sentence (the implementation predicts from a word to a line of any length).
  • Using the maximum likelihood estimator (a.k.a. the word with the highest number of occurrences in our Corpus)

Demo product

  • A version (less powerful than the original with all the data) of the natural language processing model is found in this app
  • In the first tab (Guess word):
    1. You must enter the phrase (in the text box).
    2. The entry is confirmed and the typescript is displayed
    3. In the bottom the best prediction for the introduced phrase is displayed
  • In the secon tab (Another guessesAnother guesses)
    • If you are not satisfied with 'the best prediction' of the first tab, the second tab shows a table with other prediction possibilities (the field 'frequency' indicates the order of the options)