Coursera Data Science Capstone Project

d2ski
Sun Apr 26 13:41:41 2015

Final report

'Next Word Prediction' application

Executive summary

  • The goal of the Capstone project was to develop an algorithm for solving the 'Next word prediction' problem and implement it as Shiny web application

  • The problem was solved using Natural Language Processing N-Gram model

  • 3-Gram 'stupid' backoff model was used in the final implementation

  • The link for the final web app

Why 3-Gram 'stupid' backoff?

  • A common approach for NLP problems is N-Gram model:
    – a lot of available NLP resources proposes this model as the most suitable for this problem

  • Linear interpolation and back off were initially considered and tested due to the goal of the project and its constraints:
    – The goal is accuracy of the next word prediction (not perplexity of the sentence)
    – other methods, like Kneser-Ney smoothing, are more computationally complex and require more memory (free Shinyapps plan has some restrictions about this)

Why 3-Gram 'stupid' backoff? Tests

The models with different parameters were tested:

Model Ngram Ngram.counts Stemmed.tokens Accuracy
Linear 4-grams >2 No 0.14
Linear 4-grams >3 No 0.14
Linear 4-grams >1 Yes 0.10
Linear 3-grams >1 No 0.10
Backoff 3-grams >0 No 0.20
Backoff 3-grams >1 No 0.20

3-Gram 'stupid' backoff had better accuracy of 1 next word prediction (20%).

Also it follows Dan Jurafsky statement from Natural Language Processing MOOC, that 'stupid' backoff models are best for web-scale N-Grams data.

The algorithm

  • If trigrams with 2 last entered words as prefix were found, it returns 3 most frequent trigram suffixes (ordered by frequency) as predicted words
  • If there were no such trigrams, the suffixes of the most frequent bigram with 2 last entered words as prefix are returned similarly (ordered by frequency)
  • Else 3 most frequent unigrams are returned (ordered by frequency)

How to use the app?

The link for the final web app

– It predicts 1 most likely next word (default option in the drop-down list)
– Additionaly it allows to choose more words from the drop-down list
– Also it predicts the end of the sentence (“.”, “!”, “?” are considered as the next word prediction)

To use this app:
– Input the sentence or pharse in the top text field
– Select the predicted next word from the drop-down or use default most likely next word prediction
– Press 'Submit' button to complete the phrase with predicted word