Milestone Report

Introduction

This is the start of the Data Science Specialisation Capstone Project to build a predictive text model. We are given three Englist text files (a blog text, a twitter text and a news text) from a corpus called HC Corpora, which will form the training dataset for our predictive text model. At this stage, we hope to have a better understanding of the text data to build a n-gram dictionary for the prediction model.

Exploratory (Text) Data Analysis

First, we seek to understand the distribution and relationship between the words, tokens, and phrases in the three text files, so as to prepare to build our first linguistic models. This includes the frequencies and variation in the frequencies of words and word pairs/phrases in the text files.

Text Size
TextSource	Object_Size_in_Bytes	Line_Count	Word_Count
Filetwit	316037344	2360148	30373792
Fileblog	260564320	899288	37334441
Filenews	20111392	77259	2643972

Creating Subsets of Twitter, Blog and News Text through Random Sampling

Next, given the large text size, we create random samples from the three text files with about 20,000 words for each sample for exploratory analysis. Altogether, these account for 0.9% of the original text files in terms of word count. This will allow sufficient text to build a n-gram dictionary. The random samples are put together and loaded as a Corpus for subsequent text preprocessing/cleaning.

Sample Text Size
TextSource	Object_Size_in_Bytes	Line_Count	Word_Count
twitsample	2170432	16000	206072
blogsample	1448800	5000	208047
newsample	1570176	6000	206915

Pre-Processing the Random Samples of Twitter, Blog and News

Upon loading the sample data as a corpus, we start “cleaning” the text. Text transformation is performed using tm_map() function for the following:

Replace contractions to full words for better predictive power and to denote the n-gram correctly. For instance, for input text “I”, the next possible word could be “will”,“am” etc if we have the full words stored, else model may predict it as “ll” or “m” instead.
Remove URL and replace “/”, “@” and “|” in the text with space
Remove special characters such as â and ê that may be found in foreign language (except for some special characters which are observed to be better replaced by apostrophe as the words are mainly contractions - this particular step is done before step (a) so that they are converted to full words if they are assessed to be contractions )
Convert all text to lower case for ease of analysis
Remove numbers, punctuation (but preserving intra-word dashes) and extra white space
Remove hash tags and twitter handles
Remove profanity (Note: The list of bad words are from Luis von Ahn’s research group at CMU, see http://www.cs.cmu.edu/~biglou/resources/)

Common English stopwords are however not removed as these stopwords are possible text and useful in our predictive text modelling. Text stemming is also not performed, as we want to capture all forms of words and not just reduce words to their root form.

Building a N-gram Dictionary

Subsequently, the text is converted into a term-document matrix for further computation. This approach results in a matrix with document IDs as rows and terms as colums. The matrix elements are term frequencies. The frequencies of unigram (n-gram of size 1), bigram (n-gram of size 2), trigram (n-gram of size 3) and four-gram are displayed in the barplots and word clouds below.

Interesting Findings

In doing this stage of the project, some interesting findings gathered are as follows:

Importance of converting special characters accordingly as it contributes to word frequencies which we are dependent upon to build the word prediction model - i did a comparison before and after for the word “not” in my sample and the difference is more than 700
Importance of replacing the contractions into full words as it not only contributes to word frequencies but where these words would fall in. For instance the word “don’t” - will be counted as part of unigram dictionary if it is not converted, but will be part of bigram dictionary if it is converted and counted under “do”. It aids in prediction when the user enters “do” or “not” or “do not” and those before and after each of these words/phrases
Importance of knowing the type of (data) text we are dealing with, e.g. character or Vcorpus or Corpus as certain functions only work with particular types of (data) text
Slope of the frequency distribution is the steepest for Unigram (i.e. the frequency of unigram words drop rapidly) and it becomes more gradual as we move in the order of the n-gram. This also explains the wide frequency range of 28,551 to 1,850 for the top 20 unigram words as compared to the range of 70-27 for the top 20 phrases in the four-gram dictionary.
From the summary table below, it is interesting to note that though the number of unique words/phrases is not a lot, the occurences of these unigram, bigram, trigram and four-gram words are high and represent more than 75% of the words in the random sample.

N-Gram Dictionary
Ngram_Dictionary	Unique_Words_Phrases	Freq_in_Sample	Percentage_of_Sample
Unigram	43588	469414	75.58588
Bigram	289955	584274	94.08084
Trigram	477208	557455	89.76240
Fourgram	516606	531413	85.56907

Plans for Creating a Prediction Algorithm and Shiny App

With the Unigram, Bigram, Trigram and Four-Gram dictionaries created from the sample, we are ready to build the prediction algorithm. Based on Markov’s Assumption which states that “The future is independent of the past given the present”, we rely on the last few words of the input, especially the last word.

In other words, for a bigram model, \(P\)(the | its water is so transparent that) is approximately the same as \(P\)(the | that).

A bigram prediction model is possible, but it may not capture word phrases effectively as language has long-distrance dependencies.We would therefore rely on higher order n-gram dictionary as well for the model building. To increase the effecitveness of word check, we would subsequently keep words in the dictionaries with frequencies of at least four. This will help expedite checks and save memory space since we are looking at words/phrases with high frequencies.

Outline of the alogrithm is as follows:

Read the user’s input and do similar text transformations that we did for the sample (which forms my training data) above.
Depending on the number of words entered by the user, the App will return the last one to maximum three words. We will then check the words against the unigram, bigram, trigram and four-gram dictionaries where applicable. For instance, if the input is “of”, we will check against the bigram dictionary with words starting with “of” and note the top 5 next words and their corresponding frequencies. Please see example below.
Model will predict five possible “next” words based on the maximum likelihood estimate. For estimating bigram probabilities, the maximum likelihood estimate is: \(P(w_i|w_{i-1})\)=\(\frac{count(w_{i-1},w_i)}{count(w_{i-1})}\). For input text of only one word, we would use frequencies as maximum likelihood estimate is not necessary for comparison.
Model will be tested on the validation sample before building the Shiny App for use.

input<-"of"

##      NextWord FrequencyNextWord
## [1,] "the"    "2427"           
## [2,] "a"      "444"            
## [3,] "my"     "277"            
## [4,] "his"    "187"            
## [5,] "our"    "131"