Introduction

This is the start of the Data Science Specialisation Capstone Project to build a predictive text model. We are given three Englist text files (a blog text, a twitter text and a news text) from a corpus called HC Corpora, which will form the training dataset for our predictive text model. At this stage, we hope to have a better understanding of the text data to build a n-gram dictionary for the prediction model.

Exploratory (Text) Data Analysis

First, we seek to understand the distribution and relationship between the words, tokens, and phrases in the three text files, so as to prepare to build our first linguistic models. This includes the frequencies and variation in the frequencies of words and word pairs/phrases in the text files.

Text Size
TextSource Object_Size_in_Bytes Line_Count Word_Count
Filetwit 316037344 2360148 30373792
Fileblog 260564320 899288 37334441
Filenews 20111392 77259 2643972

Creating Subsets of Twitter, Blog and News Text through Random Sampling

Next, given the large text size, we create random samples from the three text files with about 20,000 words for each sample for exploratory analysis. Altogether, these account for 0.9% of the original text files in terms of word count. This will allow sufficient text to build a n-gram dictionary. The random samples are put together and loaded as a Corpus for subsequent text preprocessing/cleaning.

Sample Text Size
TextSource Object_Size_in_Bytes Line_Count Word_Count
twitsample 2170432 16000 206072
blogsample 1448800 5000 208047
newsample 1570176 6000 206915

Pre-Processing the Random Samples of Twitter, Blog and News

Upon loading the sample data as a corpus, we start “cleaning” the text. Text transformation is performed using tm_map() function for the following:

  1. Replace contractions to full words for better predictive power and to denote the n-gram correctly. For instance, for input text “I”, the next possible word could be “will”,“am” etc if we have the full words stored, else model may predict it as “ll” or “m” instead.
  2. Remove URL and replace “/”, “@” and “|” in the text with space
  3. Remove special characters such as â and ê that may be found in foreign language (except for some special characters which are observed to be better replaced by apostrophe as the words are mainly contractions - this particular step is done before step (a) so that they are converted to full words if they are assessed to be contractions )
  4. Convert all text to lower case for ease of analysis
  5. Remove numbers, punctuation (but preserving intra-word dashes) and extra white space
  6. Remove hash tags and twitter handles
  7. Remove profanity (Note: The list of bad words are from Luis von Ahn’s research group at CMU, see http://www.cs.cmu.edu/~biglou/resources/)

Common English stopwords are however not removed as these stopwords are possible text and useful in our predictive text modelling. Text stemming is also not performed, as we want to capture all forms of words and not just reduce words to their root form.

Building a N-gram Dictionary

Subsequently, the text is converted into a term-document matrix for further computation. This approach results in a matrix with document IDs as rows and terms as colums. The matrix elements are term frequencies. The frequencies of unigram (n-gram of size 1), bigram (n-gram of size 2), trigram (n-gram of size 3) and four-gram are displayed in the barplots and word clouds below.

Interesting Findings

In doing this stage of the project, some interesting findings gathered are as follows:

  1. Importance of converting special characters accordingly as it contributes to word frequencies which we are dependent upon to build the word prediction model - i did a comparison before and after for the word “not” in my sample and the difference is more than 700
  2. Importance of replacing the contractions into full words as it not only contributes to word frequencies but where these words would fall in. For instance the word “don’t” - will be counted as part of unigram dictionary if it is not converted, but will be part of bigram dictionary if it is converted and counted under “do”. It aids in prediction when the user enters “do” or “not” or “do not” and those before and after each of these words/phrases
  3. Importance of knowing the type of (data) text we are dealing with, e.g. character or Vcorpus or Corpus as certain functions only work with particular types of (data) text
  4. Slope of the frequency distribution is the steepest for Unigram (i.e. the frequency of unigram words drop rapidly) and it becomes more gradual as we move in the order of the n-gram. This also explains the wide frequency range of 28,551 to 1,850 for the top 20 unigram words as compared to the range of 70-27 for the top 20 phrases in the four-gram dictionary.
  5. From the summary table below, it is interesting to note that though the number of unique words/phrases is not a lot, the occurences of these unigram, bigram, trigram and four-gram words are high and represent more than 75% of the words in the random sample.
N-Gram Dictionary
Ngram_Dictionary Unique_Words_Phrases Freq_in_Sample Percentage_of_Sample
Unigram 43588 469414 75.58588
Bigram 289955 584274 94.08084
Trigram 477208 557455 89.76240
Fourgram 516606 531413 85.56907

Plans for Creating a Prediction Algorithm and Shiny App

With the Unigram, Bigram, Trigram and Four-Gram dictionaries created from the sample, we are ready to build the prediction algorithm. Based on Markov’s Assumption which states that “The future is independent of the past given the present”, we rely on the last few words of the input, especially the last word.

In other words, for a bigram model, \(P\)(the | its water is so transparent that) is approximately the same as \(P\)(the | that).

A bigram prediction model is possible, but it may not capture word phrases effectively as language has long-distrance dependencies.We would therefore rely on higher order n-gram dictionary as well for the model building. To increase the effecitveness of word check, we would subsequently keep words in the dictionaries with frequencies of at least four. This will help expedite checks and save memory space since we are looking at words/phrases with high frequencies.

Outline of the alogrithm is as follows:

  1. Read the user’s input and do similar text transformations that we did for the sample (which forms my training data) above.
  2. Depending on the number of words entered by the user, the App will return the last one to maximum three words. We will then check the words against the unigram, bigram, trigram and four-gram dictionaries where applicable. For instance, if the input is “of”, we will check against the bigram dictionary with words starting with “of” and note the top 5 next words and their corresponding frequencies. Please see example below.
  3. Model will predict five possible “next” words based on the maximum likelihood estimate. For estimating bigram probabilities, the maximum likelihood estimate is: \(P(w_i|w_{i-1})\)=\(\frac{count(w_{i-1},w_i)}{count(w_{i-1})}\). For input text of only one word, we would use frequencies as maximum likelihood estimate is not necessary for comparison.
  4. Model will be tested on the validation sample before building the Shiny App for use.
input<-"of"
##      NextWord FrequencyNextWord
## [1,] "the"    "2427"           
## [2,] "a"      "444"            
## [3,] "my"     "277"            
## [4,] "his"    "187"            
## [5,] "our"    "131"