As required for the first part of the assigment, this report contains exploratory analysis to create a text prediction algorithm. Composite datasets were provided from news articles, twitter, and blogs. The data will be used to train an alogorithm to create a Shiny app.
Basic summary of the datasets are displayed below:
## Lines LinesNEmpty Chars CharsNWhite WordCount WordAverage
## blogs 899288 899288 206824382 170389539 37546239 41.75107
## news 1010242 1010242 203223154 169860866 34762395 34.40997
## twitter 2360148 2360148 162096241 134082806 30093413 12.75065
A random sample was taken from each of the three datasets to illustrate major features of the data relevant to text prediction. Sample data was then cleaned for better processing and the most common words are more prevalant in the illustration below. Results in graphic were limited to 150 words.
An N-gram tokenization is used to see what groups of words appear most frequently. The top fifty 2-grams and 3-grams are depicted in the graphs below.
Once over 3 grams are utilized, the usefullness of the analysis decreases. The below graph displays the top twelve 4-grams.
The graph below shows the top eight 5-grams which appears less useful.
Utilizing this information, an predictive text application will be built using 2 and 3 gram models and utilize punctuation to improve data tokenization for more accurate results.