Data Science Capstone Project

Philip Mayfield
1/22/2018

Predictive text using Markov Chain prediction methods.

Goal

The goal of this project is to create a Shiny application that is capable of predicting the next word in a sentence. The source data is from three sources.

  • Twitter Data
  • Blogs
  • News stories

All of the text placed together in one data structure forms a “corpus”

From this corpus, I will create a algorithm that will predict the next word in a sequence.

Algorithm

The source text (Twitter, blogs, and news) were cleaned to remove puncuation, non-english words, and numbers and then divided into ngrams using the brand new (Nov 2017) aptly named R packaged called “ngrams”. An “nGram” is a sequence of n words that are commonly used. For example, a commonly used 4 word nGrams is “the end of the”. Thus, if some types “the end of” the algorithm will predict “the” as the next word.

My algorithm creates 2,3, and 4 word nGrams in order of commonality. The algorithm then uses the following order of priority.

  • If a 4 word nGram has a match to the user's text, it is used.
  • If a 4 word nGram can't match, then the 3 word nGrams are searched for a match and used if available.
  • Finally, if the 3 word nGram doesn't find a match, the 2 word nGram is searched and used if available.

How to use the app

Start typing a sentece in the text box “Enter text here”.

As you type, the next predicted word will appear under “Sequential predicted word”. There are no buttons to press, the prediction occurs automatically.

If my algorithm can't find a match, then “NA” will appears as the next predicted word.

Use the App

My text prediction algorithm is at the link below

https://philipmayfield.shinyapps.io/TextPredict/

Please give the application a few seconds to load. As you type, each word is sent to my algorithm which takes the words you type, isolates the last four words, and then sends them to the predictive algorithm.