Sora
July 8, 2016
This is an algorithm used to predict the next word base on users' text input.
It is the project from Coursera Data Science by Johns Hopkins University. A shiny application was built to test the accuracy of the algorithm.
We are using data sampled from Twitter, News and Blogs. Data from Twitter and News will compose about 80% of the sample, as the languages are more close to daily life conversation.
The basic model is build N-gram directories. The main method is “In order to do it:
Step 1: clean the sample, (eg: remove numbers, non-English word, decapitalized, etc)
Step 2: break the sample into 1 / 2 / 3 /4 -gram sets
Step 3: build a function which takes users' input and find matches in n-gram sets
When the algorithm can not find the next word based on the input, it will do a “downgrade” and search the word in the (n-1)-gram.
If users type in inputs that are not contained in the sample sets, the input will be added into the corresponding n-gram dataset.
Important-please give the app a few seconds to search
How to use it:
First select the number of words that the input contains, next type in the inputs. The app will return with prediction of what the next word might be