From a given dataset of english text I created a smaller corpora and discovered the structure in the data and how words were put together in a sentence. I cleaned and analyzed the text data, build and sampled from a predictive text model and finally build a predictive text product in a RStudio Shiny app.
I started out with a huge dataset consisting of three text files (US.blogs, US.twitter, US.news) all in the english language. As this dataset was too large to handle by the CPU, I created a smaller corpora taking equall percentage of words from the three datasets. The final corpora consisted of 370.730 blog words, 300.257 twitter words and 351.572 news words.
I identified appropriate tokens such as words, punctuation, and numbers and wrote a function that took a file as input and returned a tokenized version. I cleaned the corpora only lower case letters remained, removed profanity, whitespaces, punctuations, numbers and other elements that I didn't want to predict.
I performed a thorough exploratory analysis of the data so that I understood the distribution and frequencies of words and relationship between the words in the corpora. I visualized these with a wordcloud and through histograms for the Ngram, 2, 3 and 4Ngrams.
By doing some tests it became clear that the speed to predict the next words was an issue. To solve this I used a less sophisticated but very durable statistiscal model; The trigram (or 2nd order Markov) model for language modeling. This model makes the assumption that only the previous n-1 words have effect on the probability of the next word. Although this is oversimplified this statistical language model does an acceptable job by predicting the next word within an acceptable timeframe.
For more info of the Trigram 2n Markov model Click here
The app predicts the next word of a word or words entered. It's the end result where from a big dataset a corpora was created, tokenized and filtered from words and elements not needed. After an exploratory analysis, a Trigram model was build and a simple User interface shows the prediction result.
The user interface shows a container with the text: Enter your english text here. By entering one word, two or three words the app predicts the next word through the Trigram algorithm. The predicted word is shown in the main panel of the User interface in red under the text: The predicted next word.
App is build in RStudio Shiny, Click for the app