I loaded the appropriate packages and split the documents into word count, number of words, number of blank spaces, amount of punctuation, number of non-english words and number of numbers in each text document.
## Warning in readLines(newsfile, encoding = "UTF-8", skipNul = TRUE):
## incomplete final line found on 'en_US.news.txt'
As seen below, there is a lot of data in these three documents. So much data that I won’t be able to use the entire documents for my analysis, therefore I will have to take an unbiased random sample from the data the represents the data as a whole.
## [1] 36434843
## [1] 6536746
## [1] 716174
## [1] 411373
## [1] 2566710
## [1] 533196
## [1] 22587
## [1] 64181
## [1] 28013435
## [1] 7877048
## [1] 114774
## [1] 505709
## Blog Info News Info Twitter Info
## Number of Rows 899288 77259 2360148
## Word Count 37546246 2674536 30093410
## Blankspace COunt 36434843 2566710 28013435
## Punctuation Count 6536746 533196 7877048
## NonEnglish COunt 716174 22587 114774
## Number Count 411373 64181 505709
I then created the corpus, whcih I transformed as seen below. The next step was to tokenize the corpus and create a matrix. Finally I filtered the data to create a unigram with a word frequency greater than 200, a bigram with a two-word frequency greater than 50 and a trigram with a three-word frequency greater than 50.
## [1] 206824505 15639408 162096241
## Warning in nr * nc: NAs produced by integer overflow
## Warning in nr * nc: NAs produced by integer overflow
## Warning in nr * nc: NAs produced by integer overflow
Using the created N-Grams, I ploted the most common unigrams, bigrams and trigrams, along with a word cloud.
## Warning: Removed 4 rows containing missing values (position_stack).
This concludes my exploratory analysis.
Now that I have performed some exploratory analysis, and built some preliminary n-gram models, a potential strategy for the final product would be using the n-gram model with a frequency look-up table combined with a back-off technique.
For the user interface, I plan to create a Shiny app with a simple user interface for text input and display a list of suggested “next” words based on the prediction model.