Capstone Project: Natural Language Processing

Asier Goikoetxea
2016/10/10

1- Introduction

The goal of this project is to explore different NLP models and to create a practical application that is capable of predicting the next word of a sentence. The project is divided in several phases:

Understanding the problem and the Data
- Exploration
Model Selection
Developing a Shiny App
Evaluation

2- Exploration

Three files with RAW data:
- Twitter, news articles and blogs.
Sample the data
Clean the Data:
- lower the text, remove punctuation, remove stopwords, remove bad words.
Tokenize: create 1Gram, 2Gram, 3Gram and 4Gram tables.

     file      size   lines    words
1 twitter 316037600 2360148 30093410
2   blogs 260564320  899288 37546246
3    news 261759048 1010242 34762395

3- Model Selection

Different options for Back-Off Models:

Katz Back-off: Good Turing discounting
Kneser-Ney discounting
- Most complicated logic
- Best accuracy
Stupid Back-Off (Fixed 0.4 discount value)
- Most simple and low computing resources
- Similar results tu Kneser-Ney with very big corpus
  - More information: The Unreasonable Effectiveness of Data

4- Final App

The app consists in two columns: the left column is for user input and the right one is for showing the output. My Shiny App Link

Inputs: text field to type the text and a Button to excecute the model and guess the word.
Outputs:
- The single word with the highest propability
- The top 5 words with the highest probability

5- Evaluation

Limitations:

App Accuracy:
- Stupid Backoff model needs a very big corpus to have competitive accuracy levels which I couldn't use in my demo.
- Unknown word handling could be improved

Positive aspects:

Code simplicity and interpretability
Low resource requirement and speed