Introduction

In this report are high level explorations of the datasets. Line counts and words counts are included for each dataset, as well as the top ten bigrams for each dataset.

Line counts…

In blogs.txt, there are 899288 lines/entries.
In news.txt, there are 77259 lines/entries.
In twitter.txt, there are 2360148 lines/entries.

Word counts…

In blogs.txt, the most words in a single line/entry is 483415.
In news.txt, the most words in a single line/entry is 14556.
In twitter.txt, the most words in a single line/entry is 1484357.

Top bigrams by source

Plans for modeling and app

I will combine the three datasets into a single random corpus, from which to perform the predictive text model. I will thoroughly clean the datasets, with the goal of including only English words, without punctuation or profanity.
I will take a relatively small and random subset of the data, since the goal of predictive text is to find the most likely next word on a broad spectrum, and not necessarily a “contextually aware”" next word. Then I will research and choose the best model to implement the predictive algorithm.
To build the Shiny app, I plan to have a fairly basic UI that lets the user start typing, and will be able to select from the list of predicted words, or continue typing. The model will rerun and provide a new suggestion every time the user enters a space character in their string of text.