Capstone Project: Natural Language Processing

Asier Goikoetxea
2016/10/10

1- Introduction

The goal of this project is to explore different NLP models and to create a practical application that is capable of predicting the next word of a sentence. The project is divided in several phases:

  • Understanding the problem and the Data
    • Exploration
  • Model Selection
  • Developing a Shiny App
  • Evaluation

2- Exploration

  • Three files with RAW data:
    • Twitter, news articles and blogs.
  • Sample the data
  • Clean the Data:
    • lower the text, remove punctuation, remove stopwords, remove bad words.
  • Tokenize: create 1Gram, 2Gram, 3Gram and 4Gram tables.
     file      size   lines    words
1 twitter 316037600 2360148 30093410
2   blogs 260564320  899288 37546246
3    news 261759048 1010242 34762395

3- Model Selection

Different options for Back-Off Models:

  • Katz Back-off: Good Turing discounting
  • Kneser-Ney discounting
    • Most complicated logic
    • Best accuracy
  • Stupid Back-Off (Fixed 0.4 discount value)

4- Final App

The app consists in two columns: the left column is for user input and the right one is for showing the output. My Shiny App Link

  • Inputs: text field to type the text and a Button to excecute the model and guess the word.
  • Outputs:
    • The single word with the highest propability
    • The top 5 words with the highest probability

5- Evaluation

Limitations:

  • App Accuracy:
    • Stupid Backoff model needs a very big corpus to have competitive accuracy levels which I couldn't use in my demo.
    • Unknown word handling could be improved

Positive aspects:

  • Code simplicity and interpretability
  • Low resource requirement and speed