Spiros Paraskevas
Hallo. Capstone project was about building an algorithn that based on user input, would display a next meaningfull word (within a shiny app).
Uses are numerous (mobile phones SMS writing, web apps where a search engine exists). Increase of writting speed, and ideas provision of ideas when searching are obvious benefits.
NOTE: The whole dataset was used and not samples of it, using a 4GB RAM laptop.
The preprocessing steps were:
Data cleaning (numbers, profanity words, words with non US english alphabetical terms, spelling checks)
Frequencies of unique words were computed and retained words that reflected 90% of the text. (~ 5000 unique words per dataset: twitter, blogs, news.)
The filtered text was further processed in chunks (to compensate the 4GB available RAM) so as to find frequent 2grams, 3grams, 4grams and 5grams.
Resulted grams were combined in data tables, were split per word and the respective frequency (and probability) was attached.