Context

The data explored on this page will be used to build a predictive word search and incorporate it into a Shiny Application. Documented below are features of the data and an outline of the approach for the application.

The Corpus

The Corpus of 3 files is provided by Coursera in the context of the John’s Hopkins Data Science Specializaiton Capstone project. Each is a sample from 3 different sources: Twitter, Blogs and News Articles. In the table below are some basic descriptors of each file. Note, no preprocessing has been performed.

Sampling

Extracting a sample is necessary given the size of the corpus, the computing power available on my development pc, and the desire to create a prediction application that will ultimately run on a mobile device. After some experimentation, I have extracted a randomly selected 5% sample of lines from each source and combined them into a single file. That sample is used to create the summaries below. There are 171,476 words in the Oxford English Dictionary in current use. The 5% sample contains about 145,000 terms; however, not all of them are proper words. As part of the next steps for this project, I will help validate whether the sample size is sufficient to create a viable application.

Word Frequencies

The following 1-3-gram frequencies were obtained from the sample after preprocessing (convert to lowercase, remove punctuation, numbers, stop words, banned words)

Next Steps To Build Word Predication Application

  1. Build next word prediction mode based on n-grams
  2. Refine model to handle unseen patterns
  3. Evaluate model accuracy and refine (e.g. change sample size)
  4. Create Shiny application user interface and connect to word prediction model
  5. Tune for performance issues along the way.