Milestone

R Markdown

The goal of the project is to build a predictive text model based on natural language processing techniques. The model will then be incorporated into a Shiny App to provide a user interface.

Dataset

Building such a model requires a dataset to discover how words and sentences are put together. The data for this project have been downloaded from here.

The dataset contains sentences scraped from news reports, blogs and twitter tweets in three separate files, for several different languages. In this project, we will be using the English language dataset.

A table summarising the basic features of the three datasets is given below. It contains the size of each dataset in Megabytes, the number of sentences and the number of words.

##     files     size sentences    words
## 1    News 196.2775   1010242 34372530
## 2   Blogs 200.4242   2360148 30373543
## 3 Twitter 159.3641    899288 37334131

Since the number of words is over 30 million in each dataset, we will create and use a subsample for future analyses.

Subsample

We use the sample function to generate subsamples of each dataset, each containing 10,000 lines, and we also create a combined dataset with 30,000 lines.

##   sentences  words
## 1     30000 882124

Cleaning

The data need to be cleaned before further analysis. We carry out the following steps:

Remove Profanities
Remove Punctuation
Remove Numbers
Transform capital letters to lower case
Remove extra whitespace

To help clean the data and identify words, we use the Text Mining package.

To filter profanities, we use the Google Bad Words database. We consider removing stopwords which are commonly used English words, but decide against it as we would like our algorithm to correctly predict such words too.

N-Gram Analysis

We will start by examining which words are most frequent or what is technically known as unigram analysis. The figure below shows the most commonly used words and their frequencies within our sample of 30,000 sentences.

We carry out the same analysis for bigrams and trigrams i.e. two and three word combinations. The figures below show the most frequently occuring bigrams and trigrams and their frequencies.

Further Work

We will use the n-grams, namely the unigrams, bigrams and trigrams identified from the text to construct a predictive model. Finally, we will build a Shiny App to act as a user interface for this model.