Synopsis

This report is the Milestone Report Submission for the Capstone Project of the Data Science Specialisation course offered by Johns Hopkings Bloomberg School of Public Health and Coursera. The Capstone Project involves analysing the structure of a large corpus of text documents to build a predictive text product.
This report summarises the major features of the data and outlines the plans to create a prediction algorithm and Shiny app. This report also presents the initial steps taken in analysing the data such as downloading and cleaning the data.

Getting and Cleaning Data

The data used for the Capstone Project comes from a collection of corpora for 4 different languages: English, German, Finnish and Russian. The corpora have been collected from various types of sources (newspapers,magazines,blogs and Twitter updates).
The data used for this report and relative application is the American English corpora.

Firstly, we check number of lines and words for each file.

The en_US.blogs.txt file contains 899288 lines and 37332736 words.
The en_US.news.txt file contains 1010242 lines and 34372530 words.
The en_US.twitter.txt file contains 2360148 lines and 30373543 words.

As the size of the files is quite substantial we just take a random sample of the text contained in all three files.

We then proceed to clean the data to remove raw text formats that could create issues when performing text mining tasks. We convert all text to lower case, remove punctuation, numbers and extra white spaces and profanity.

For this project, the profanity word list has been sourced from the following URL: https://github.com/quellhorst/negative-keywords/blob/master/profanity.txt

Stopwords have not been removed at this stage as deemed useful to predict words following an inputted text.

Exploratory Data Analysis

This section summarises the main features of the sample.

Using a word cloud we take a quick look at the most frequent words in the sample.

We then use N-grams models to identifies top 20 terms most frequently found in the documents.

We use bi-grams and tri-grams to get a better understanding of words commonly used together.

The top 20 bi-grams represent word groups of length two that occur together in order.

The top 20 tri-grams represent word groups of length three that occur together in order.

Text Prediction

This section outlines the basic features and functionalities of the Shiny app.

The User Interface (UI) will consist of two parts: an Input part on the left where the text is typed or pasted in by the User and an Output part on the right where the words predicted to follow the input text will be displayed.
The Input part of the app will consist of a TextInput widget where English text could be typed or pasted into by the user. Below the TextInput widget, a SubmitButton widget will enable the user to send the text to the predictive algorithm.
The Input part of the app will consist of a no-choose prompt where the top-3 or top-5 words to follow will be visualised. The predicted words will be found by searching the last 2 words in the trigrams and retrieve the top-3 or top-5 final words of matching patterns to send back to the UI. If not found, the algorithm will search for the last word in bigrams an retrieve the top-3 or top-5 final words of matching patterns to send back to the UI.
If not found, smoothed probabilities will be used to estimate the most likely words to follow.
If not found, back-off models will be used to estimate the probability of unobserved n-grams.