Jose Antonio Garcia Ramirez
December 23, 2017

Coursera Data Science Capstone Project (NLP)

  • Get data
  • Data wrangling
  • Data processing
  • Model and demo product

Get data

We have three large files (sets of words) here.

After unzipping, we have the following

File Lines Max length of line (characters)
blogs 899,288 40,833
news 77,259 5,760
twitter 2,360,148 140

Data wrangling

Following the work of [1] we perform the following transformations to clean the data:

  • Remove whitespaces, punctuation and numbers.
  • Transform to lower letters
  • Remove stop words like a, an, by
  • stem words

[1]: Pengda Qin, Weiran Xu and Jun Guo, 'A Targeted Retraining Scheme of Unsupervised Word Embeddings for Specific Supervised Tasks,' in Advances in Knowledge Discovery and Data Mining 2017, Springer.

Data processing

Due to the limitations of the computer equipment, divided the files of blogs and news in 8 parts each, then we process (extraction of n-grams) in the end we join the results of each part.

etl

Model and demo product

  • Finally, using n-grams, we can predict the next word of a sentence (the implementation predicts from a word to a line of any length).

  • A version (less powerful than the original with all the data) of the natural language processing model is found in this app