The goal of the project is to build a predictive text model based on natural language processing techniques. The model will then be incorporated into a Shiny App to provide a user interface.
Building such a model requires a dataset to discover how words and sentences are put together. The data for this project have been downloaded from here.
The dataset contains sentences scraped from news reports, blogs and twitter tweets in three separate files, for several different languages. In this project, we will be using the English language dataset.
A table summarising the basic features of the three datasets is given below. It contains the size of each dataset in Megabytes, the number of sentences and the number of words.
## files size sentences words
## 1 News 196.2775 1010242 34372530
## 2 Blogs 200.4242 2360148 30373543
## 3 Twitter 159.3641 899288 37334131
Since the number of words is over 30 million in each dataset, we will create and use a subsample for future analyses.
We use the sample function to generate subsamples of each dataset, each containing 10,000 lines, and we also create a combined dataset with 30,000 lines.
## sentences words
## 1 30000 882124
The data need to be cleaned before further analysis. We carry out the following steps:
To help clean the data and identify words, we use the Text Mining package.
To filter profanities, we use the Google Bad Words database. We consider removing stopwords which are commonly used English words, but decide against it as we would like our algorithm to correctly predict such words too.
We will start by examining which words are most frequent or what is technically known as unigram analysis. The figure below shows the most commonly used words and their frequencies within our sample of 30,000 sentences.
We carry out the same analysis for bigrams and trigrams i.e. two and three word combinations. The figures below show the most frequently occuring bigrams and trigrams and their frequencies.
We will use the n-grams, namely the unigrams, bigrams and trigrams identified from the text to construct a predictive model. Finally, we will build a Shiny App to act as a user interface for this model.