Capstone

  • Author: Lex Knape
  • Date: January 2016
  • Capstone project for the JHU Data Science Specialization
  • Goal: Build a predictive text model

picture of the app

The Capstone project

From a given dataset of english text I created a smaller corpora and discovered the structure in the data and how words were put together in a sentence. I cleaned and analyzed the text data, build and sampled from a predictive text model and finally build a predictive text product in a RStudio Shiny app.

The dataset

I started out with a huge dataset consisting of three text files (US.blogs, US.twitter, US.news) all in the english language. As this dataset was too large to handle by the CPU, I created a smaller corpora taking equall percentage of words from the three datasets. The final corpora consisted of 370.730 blog words, 300.257 twitter words and 351.572 news words.

Tokenize and clean the corpora.

I identified appropriate tokens such as words, punctuation, and numbers and wrote a function that took a file as input and returned a tokenized version. I cleaned the corpora only lower case letters remained, removed profanity, whitespaces, punctuations, numbers and other elements that I didn't want to predict.

Exploratory analysis of the data

I performed a thorough exploratory analysis of the data so that I understood the distribution and frequencies of words and relationship between the words in the corpora. I visualized these with a wordcloud and through histograms for the Ngram, 2, 3 and 4Ngrams.

picture of the app working picture of the app working picture of the app working picture of the app working picture of the app working

Description of the Algorithm

By doing some tests it became clear that the speed to predict the next words was an issue. To solve this I used a less sophisticated but very durable statistiscal model; The trigram (or 2nd order Markov) model for language modeling. This model makes the assumption that only the previous n-1 words have effect on the probability of the next word. Although this is oversimplified this statistical language model does an acceptable job by predicting the next word within an acceptable timeframe.

For more info of the Trigram 2n Markov model Click here

App description

The app predicts the next word of a word or words entered. It's the end result where from a big dataset a corpora was created, tokenized and filtered from words and elements not needed. After an exploratory analysis, a Trigram model was build and a simple User interface shows the prediction result.

The user interface shows a container with the text: Enter your english text here. By entering one word, two or three words the app predicts the next word through the Trigram algorithm. The predicted word is shown in the main panel of the User interface in red under the text: The predicted next word.

Finally: the app

App is build in RStudio Shiny, Click for the app