d2ski
Sun Apr 26 13:41:41 2015
The goal of the Capstone project was to develop an algorithm for solving the 'Next word prediction' problem and implement it as Shiny web application
The problem was solved using Natural Language Processing N-Gram model
3-Gram 'stupid' backoff model was used in the final implementation
The link for the final web app
A common approach for NLP problems is N-Gram model:
– a lot of available NLP resources proposes this model as the most suitable for this problem
Linear interpolation and back off were initially considered and tested due to the goal of the project and its constraints:
– The goal is accuracy of the next word prediction (not perplexity of the sentence)
– other methods, like Kneser-Ney smoothing, are more computationally complex and require more memory (free Shinyapps plan has some restrictions about this)
The models with different parameters were tested:
| Model | Ngram | Ngram.counts | Stemmed.tokens | Accuracy |
|---|---|---|---|---|
| Linear | 4-grams | >2 | No | 0.14 |
| Linear | 4-grams | >3 | No | 0.14 |
| Linear | 4-grams | >1 | Yes | 0.10 |
| Linear | 3-grams | >1 | No | 0.10 |
| Backoff | 3-grams | >0 | No | 0.20 |
| Backoff | 3-grams | >1 | No | 0.20 |
3-Gram 'stupid' backoff had better accuracy of 1 next word prediction (20%).
Also it follows Dan Jurafsky statement from Natural Language Processing MOOC, that 'stupid' backoff models are best for web-scale N-Grams data.
The link for the final web app
– It predicts 1 most likely next word (default option in the drop-down list)
– Additionaly it allows to choose more words from the drop-down list
– Also it predicts the end of the sentence (“.”, “!”, “?” are considered as the next word prediction)
To use this app:
– Input the sentence or pharse in the top text field
– Select the predicted next word from the drop-down or use default most likely next word prediction
– Press 'Submit' button to complete the phrase with predicted word