Overview

As required for the first part of the assigment, this report contains exploratory analysis to create a text prediction algorithm. Composite datasets were provided from news articles, twitter, and blogs. The data will be used to train an alogorithm to create a Shiny app.

Data Summary

Basic summary of the datasets are displayed below:

##           Lines LinesNEmpty     Chars CharsNWhite WordCount WordAverage
## blogs    899288      899288 206824382   170389539  37546239    41.75107
## news    1010242     1010242 203223154   169860866  34762395    34.40997
## twitter 2360148     2360148 162096241   134082806  30093413    12.75065

Data Visualization

A random sample was taken from each of the three datasets to illustrate major features of the data relevant to text prediction. Sample data was then cleaned for better processing and the most common words are more prevalant in the illustration below. Results in graphic were limited to 150 words.

An N-gram tokenization is used to see what groups of words appear most frequently. The top fifty 2-grams and 3-grams are depicted in the graphs below.

Once over 3 grams are utilized, the usefullness of the analysis decreases. The below graph displays the top twelve 4-grams.

The graph below shows the top eight 5-grams which appears less useful.

Plans

Utilizing this information, an predictive text application will be built using 2 and 3 gram models and utilize punctuation to improve data tokenization for more accurate results.