Word Predictor Application

Caio Miyashiro
12/06/2014

An application to predict the next word given the previous ones

Algorithm Description

In previous analysis[1], we preprocessed and analyzed a corpus consisting of 3 datasets. From there, it was decided to keep a subset of the corpus and work with a n-gram algorithm classification.

We used 50.000 sentences from each dataset and created 4 matrices, for 1 to 4 gram sentences. The value of each n-gram matrix was consisted of n-1 words, and the name of the index was the last word.

[1] http://rpubs.com/caiomiyashiro/dataScienceCaps0/

Algorith Description II

Efficiency Concerns

Due to RAM limitations, we kept only the n-grams with a frequency bigger than a specified value, in this case > 5.

Word Prediction

Given a phrase, the algorithm search for a possible word (at the index name) in the 4-gram matrix. If no prediction is found, the algorithm rollback and search in the 3-gram matrix and so on, until the 1-gram matrix. In this situation, we can not decide which is the next word, so we use a string similarity algorithm [2] to try to correct any misspelling in the last word.

[2] http://www.r-bloggers.com/the-stringdist-package/

Shiny App

A practical application can be seen in this shiny page [3]. At the left there is a text input, where the user can enter with a sentence without the last word. After he presses the submit button, the algorithm will find the 3 most probable words that fits the phrase. If the algorithm does not find until the 1-gram analysis, it will predict 1 word only, verifying possible misspellings in the last word.

An example: For “You're the reason why I smile everyday. Can you follow me please? It would mean the”, the algorithm correctly predicts the word “world”.

[3] http://caiomiyashiro.shinyapps.io/CapstoneShiny/

Final Considerations

Learning NLP from its basis was challenging and fun. And puting all the specialization knowledge together in one last project was really exciting.

Update. The coursera users have been having problems with the Shiny server, including me. (https://class.coursera.org/dsscapstone-002/forum/thread?thread_id=310). If you can't open my shiny project page, please run it locally. The code is small, so you can see there is no problems on executing on your local machine. All the files can be obtained here: https://www.dropbox.com/sh/ktxs7lop0but4ky/AACQWkZlJbDl9ojKpv43oiUNa?dl=0

Thank you. =D