Martin Pons
12/13/2014
A variant of n-grams algorithm has been developed.
How does it work?
The algorithm was trained using three different corpus from three different sources: blogs, twitter and news
The data was train in tokenized versions of these corpus. Joint frequency tables were obtain.
word1 word2 word3 word4 freq rel
1 the end of the 3388 0.0008199
2 the rest of the 2992 0.0007240
3 at the end of 2486 0.0006016
4 is going to be 2367 0.0005728
5 is one of the 1886 0.0004564
6 in the middle of 1873 0.0004532
Frequency tables were reestructured as trees (list of lists in R), thanks to this the computational cost (in terms of user waiting time) is minimal.
Prediction:
An application witha simple user interface has been developed. This is how it works
1- The user types a phrase
2- The user clics the “Predict” button
3- The application returns the most likely word predicted by the algoritm