This presentation features the following
- Creating n-grams data set used to predict next word in the Shiny App
- Knowledge capture during the creation of data-set and Shiny App
- Next Word Prediction - why this specific approach
12/5/2021
The raw data contains corpora in 4 different langauages, for this project only en_US locale files were downloaded.
Text mining was implemented using the tidy libraries. Due to huge size of twitter and blog files, used only 33% of their size in this project.
Some details about the data:
The predictive text model was built from a sample of 500k lines extracted from a random sample of a large corpus of blogs, news and twitter data (over 4 million lines).
The sample data was tokenized and cleaned with tidytext. Cleaning process included removing profane words, all non-ascii characters, and all words were lower-cased. The strings were then split into tokens (n-grams =4).
For the Shiny App, used quadgram (n-gram=4), basically a string of 4 words, datatable with frequencies of occurence to predict next word.
Users enter text in the text box and the predicted words are posted below the text box
Users can also select the number of words they want to predict by inputting in the amount of words below the text box.