Coursera Data Science Capstone Project

Huang-Ming Chang
January 14th, 2016

The purpose of this project is to apply all the knowledge and techniques gained from the previous courses to Natural Language Processing (NLP). The outcome of the project can be access in the following link: https://hmchang.shinyapps.io/DatascienceCapstoneApp/

N-grams

In the field of NLP, a commonly used approach is N-grams. The concept is to tokenize the texts into various number of words (n). Then we train a model with these tokenized words and use this model to predict the next word according to a given sequence of text or speech.

In this exercise, I generated three dictionaries, which consist of 2-gram, 3-gram, and 4-gram. This means that this app will take into account up to 3 last words in a sentense and predict the next word.

Back-off approach

A common approach for predicting the next word is using the Back-off model. The principle of the Back-off approach is to make predictions with N-grams in a descending order. This approach starts with the model with largest N-gram (in our case, the largest N is 4). Feed last N-1 words in the given sentense to the N-gram model in order to predict the next word. If no valid prediction is made, we back off to the (N-1)-gram model by taking the last (N-2) words in the sentense, and so on and so forth, until a valid prediction is generated. Following is the sudo code that represents the Back-off approach.

int n = Max(N)

do{ prediction = ngram(n, getLastWords(n-1)) n = n-1 } while(prediction != valid)

The Usage Of The Application

The app can be access through the following link:

https://hmchang.shinyapps.io/DatascienceCapstoneApp/

This app provides one text input and two display fields. User could type in the text input, and the first display field will show the whole sentense. With the given texts, the app will predict the next word and show it in the bottom field. If the app is unable to make a valid prediction, it will show NULL.

The current app, however, is not performing well. You have to type more words in order to obtain valid prediction. For example, type “last” will give NULL result, but type “last year I” will start to give useful predictions, e.g. “was”.

Application Screenshot

Limitation

Due to time constraints, this app is not able to handle the following:

  • non-English words
  • mispelling
  • context recognition