Cleaning and Pre-Processing the Data

I loaded the appropriate packages and split the documents into word count, number of words, number of blank spaces, amount of punctuation, number of non-english words and number of numbers in each text document.

## Warning in readLines(newsfile, encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'

Understanding the Data

As seen below, there is a lot of data in these three documents. So much data that I won’t be able to use the entire documents for my analysis, therefore I will have to take an unbiased random sample from the data the represents the data as a whole.

## [1] 36434843
## [1] 6536746
## [1] 716174
## [1] 411373
## [1] 2566710
## [1] 533196
## [1] 22587
## [1] 64181
## [1] 28013435
## [1] 7877048
## [1] 114774
## [1] 505709
##                   Blog Info News Info Twitter Info
## Number of Rows       899288     77259      2360148
## Word Count         37546246   2674536     30093410
## Blankspace COunt   36434843   2566710     28013435
## Punctuation Count   6536746    533196      7877048
## NonEnglish COunt     716174     22587       114774
## Number Count         411373     64181       505709

Processing the Data for Analysis

I then created the corpus, whcih I transformed as seen below. The next step was to tokenize the corpus and create a matrix. Finally I filtered the data to create a unigram with a word frequency greater than 200, a bigram with a two-word frequency greater than 50 and a trigram with a three-word frequency greater than 50.

## [1] 206824505  15639408 162096241
## Warning in nr * nc: NAs produced by integer overflow

## Warning in nr * nc: NAs produced by integer overflow

## Warning in nr * nc: NAs produced by integer overflow

Visualizations of the top10 N-Grams

Using the created N-Grams, I ploted the most common unigrams, bigrams and trigrams, along with a word cloud.

## Warning: Removed 4 rows containing missing values (position_stack).

Next Steps

This concludes my exploratory analysis.

Now that I have performed some exploratory analysis, and built some preliminary n-gram models, a potential strategy for the final product would be using the n-gram model with a frequency look-up table combined with a back-off technique.

For the user interface, I plan to create a Shiny app with a simple user interface for text input and display a list of suggested “next” words based on the prediction model.