The data explored on this page will be used to build a predictive word search and incorporate it into a Shiny Application. Documented below are features of the data and an outline of the approach for the application.
The Corpus of 3 files is provided by Coursera in the context of the John’s Hopkins Data Science Specializaiton Capstone project. Each is a sample from 3 different sources: Twitter, Blogs and News Articles. In the table below are some basic descriptors of each file. Note, no preprocessing has been performed.
Extracting a sample is necessary given the size of the corpus, the computing power available on my development pc, and the desire to create a prediction application that will ultimately run on a mobile device. After some experimentation, I have extracted a randomly selected 5% sample of lines from each source and combined them into a single file. That sample is used to create the summaries below. There are 171,476 words in the Oxford English Dictionary in current use. The 5% sample contains about 145,000 terms; however, not all of them are proper words. As part of the next steps for this project, I will help validate whether the sample size is sufficient to create a viable application.
The following 1-3-gram frequencies were obtained from the sample after preprocessing (convert to lowercase, remove punctuation, numbers, stop words, banned words)