Executive Summary

The goal of this project is to develop a model Shiny Web Application for a predictive text generator, which predicts the next word a user intends to type based on word frequency and context.

The natural language processing techniques for this project was executed on a computer with core i5 capabilities @ 3.10 GHzand 8.00 GB RAM.

This intermediate report describes the foremost step of this project; understanding the distribution and relationship between the words and tokens in the text. It meets the following benchmarks:

  1. Demonstrates the approach used to download and clean data

  2. Creates a basic report of summary statistics about the data sets

  3. Reports on some interesting findings

  4. Gives feedback on the plan for creating a prediction algorithm and Shiny app

Data Description

Data was provided from a content archived from heliohost.org. It can be retrieved from this {Online Data Source}.

Further information on the data can be found {here}.The data is provided in different languages. For this project,only the corpora (blog post, news articles and tweets) in the en_US local (US English) are considered.

1.0 Preparing the Environment

rm(list=ls())
library(stringi);   library(knitr);
library(quanteda);  library(ggplot2);
library(wordcloud); library(RColorBrewer);
library(doParallel);library(parallel)

set.seed(123)
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)

2.0 Loading the Data

It is assumed that the data has been unzipped and saved locally and a working directory has been established.

The data is read in binary format to preserve all characters and to ensure smoother analysis. {Reference}

R is fairly slow in reading files. The options for reading are listed from ‘slowest’ to ‘fastest’: read.table(), scan(),readLines(). {Reference}

con <- file("en_US.blogs.txt", open = "rb")
blogs <- readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
close(con)

con <- file("en_US.news.txt", open = "rb")
news <- readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
close(con)

con <- file("en_US.twitter.txt", open = "rb")
twitter <- readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
close(con)

2.1 Basic Data Summary

A basic summary is developed for each of the three dataset.

File Name File Size (Mb) Word Count Line Count Max Char Count per Line Min Char Count per Line
Blogs 200.42 37546246 899288 40833 1
News 196.28 34762395 1010242 11384 1
Twitter 159.36 30093410 2360148 140 2

The file size shown here does not take into account of the associated metadata. {Reference}

It can be observed that
* 556 MB of space is required to load all three files. Hence, sampling of data is recomended for quicker analysis.
* The Twitter data file has the most lines but fewest words, which is expected given the character limit enforced on that medium.

2.2 Text Samples from the Corpora

A sample of raw text from the corpora is shown as follows.It can be observed that the text requires further processing to remove unrecognized characters (not supported languages, emojis, etc). {Reference}

  • Blogs
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [2] "We love you Mr. Brown."
  • News
## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
  • Twitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

2.3 Data Sampling

Sampling is done using a binomial function and derived from 5% of each file. Simple text cleaning is done to remove non-ASCII characters and other special characters, such as emoticons. Subsequently, each sample is then saved separately and combined into a ‘master’ sample. {Reference}

set.seed(123)
sblogs <- blogs[sample(1:length(blogs), 0.050 * length(blogs), replace = FALSE)]
snews <- news[sample(1:length(news), 0.050 * length(news), replace = FALSE)]
stwitter <- twitter[sample(1:length(twitter), 0.050 * length(twitter), replace = FALSE)]
samplecombined <- c(sblogs,snews,stwitter)

A basic summary is developed for each of the three sample dataset.

File Name File Size (Mb) Word Count Line Count Max Char Count per Line Min Char Count per Line
Blogs 19.93 1882049 44964 5384 0
News 19.62 1742558 50512 2071 2
Twitter 15.97 1504873 118007 140 3

3.0 Data Tokenization and Visualization

A corpus of the samples are created. A list of 378 English terms that could be perceived as offensive is used as a general basis for profanity filtering. The online list (from Shutterstock) can be found {here}.

The Quanteda package is used for this analysis for its computational efficiency and simplificity compared to the conventional tm package. Both methods have been tested on and it the former package is preferred. Further readings on the advantages of Quanteda can be found {here}.

Tokenization is then implemented, as follow:{Reference 1} {Reference 2}

Note that stopwords (most common words of the English language) are not removed and stemming of words are not done, to preserve the accuracy of N-grams. Stopwords will be removed at a later stage as recommended by the creators of Quanteda. {Reference}

The top 25 most frequent UniGram (1 token) and their number of occurences are listed here.

## S3 method for class 'corpus' 
dfm_uni <- dfm(allcorpus ,ngrams=1, verbose = TRUE, tolower = TRUE,
            remove_numbers = TRUE, remove_punct= TRUE, remove_separators= TRUE,
            remove_twitter=TRUE, remove_url=TRUE, remove_symbols=TRUE,
            ignoreFeatures=profanity, 
            language = "english", thesaurus = NULL, dictionary = NULL,
            valuetype = c("glob", "regex", "fixed"), simplify= TRUE) 
save(dfm_uni,file="unigram.RData")
top_uni <- topfeatures(dfm_uni, 25) #top 25 words
top_uni
##    the     to    and      a     of      i     in    for     is   that 
## 239811 137956 120869 118966 100470  82988  82720  55303  53719  52294 
##    you     it     on   with    was     my     at   this     be   have 
##  47284  46358  40851  35522  31335  30198  28808  27277  27245  26394 
##    are    but     as     he     we 
##  24653  24342  24152  21458  21009

Using the same function, the corpora is tokenized accordingly for BiGram (2 token), TriGram (3 token) and QuadriGram (4 token). The top 25 most frequent BiGram, TriGram and QuadriGram are listed here.

##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    21709    20748    10788    10242     9727     7988     7219     6464 
##     in_a with_the     is_a   it_was    for_a from_the    i_was   i_have 
##     5949     5273     4948     4778     4692     4442     4400     4307 
##    and_i    it_is   with_a  will_be going_to     of_a   if_you     i_am 
##     4240     4212     4127     4043     4039     4027     3779     3722 
##   have_a 
##     3715
##         one_of_the           a_lot_of     thanks_for_the 
##               1733               1523               1199 
##            to_be_a        going_to_be          i_want_to 
##                892                869                766 
##         the_end_of           it_was_a         out_of_the 
##                759                747                713 
##        some_of_the         as_well_as         be_able_to 
##                690                688                687 
##        part_of_the           i_have_a looking_forward_to 
##                632                593                585 
##        the_rest_of      thank_you_for          i_have_to 
##                581                560                527 
##        a_couple_of          this_is_a          i_need_to 
##                503                499                490 
##     the_first_time        is_going_to         i_love_you 
##                487                485                478 
##         end_of_the 
##                473
##        the_end_of_the       the_rest_of_the         at_the_end_of 
##                   388                   364                   319 
## thanks_for_the_follow    for_the_first_time      at_the_same_time 
##                   309                   278                   261 
##       one_of_the_most         to_be_able_to        is_going_to_be 
##                   232                   209                   207 
##      when_it_comes_to      in_the_middle_of         is_one_of_the 
##                   201                   195                   190 
##         going_to_be_a     thanks_for_the_rt        if_you_want_to 
##                   182                   178                   160 
##     thank_you_for_the       one_of_the_best     can't_wait_to_see 
##                   159                   157                   142 
##  in_the_united_states       i_don't_want_to     thank_you_so_much 
##                   130                   129                   126 
##         by_the_end_of     the_middle_of_the        the_top_of_the 
##                   125                   125                   118 
##         i_am_going_to 
##                   116

3.1 Visualization of N-Gram

Histograms provide a visualization of the distribution of words and pattern of various n-grams. Similarly, WordClouds provide an insight of prominent words/ngrams from content point of view.

  • Histogram and WordCloud of UniGram for All Corpora

  • Histogram and WordCloud of BiGram for All Corpora

  • Histogram and WordCloud of TriGram for All Corpora

  • Histogram and WordCloud of QuadriGram for All Corpora

The frequency of bigrams is approximately ten times lesser then unigrams. The frequency of trigrams is approximately ten times lesser then bigrams and a hundred times lesser then unigrams.For Quadrigrams, their frequencies are relatively small.

3.2 Unique Word Analysis

Unique word analysis provides an insight on how much words are needed to cover a part of corpus. It is basically a Cumulative Distribution Function (CDF) analysis. A 50% coverage refers to the number of unique words representing 50% of corpus.

  • Percentage of Unique UniGrams at 50% (green) and 90% (red) Coverage Levels

143 unique UniGrams account for 50% and 7958 unique Unigrams are needed for 90% of the corpora.

  • Percentage of Unique BiGrams at 50% (green) and 90% (red) Coverage Levels

39078 unique BiGrams account for 50% and 1097146 unique Bigrams accountfor 90% of the corpora.

  • Percentage of Unique TriGrams at 50% (green) and 90% (red) Coverage Levels

1075952 unique TriGrams account for 50% and 2908655 unique Trigrams account for 90% of the corpora.

  • Percentage of Unique QuadriGrams at 50% (green) and 90% (red) Coverage Levels

1075952 unique QuadriGrams account for 50% and 2908655 unique Quadrigrams account for 90% of the corpora.

The UniGram distribution follows a conventional CDF curve. The BiGram distribution follows a unique CDF curve before sharply reaching a linear distribution. This is similar to the TriGram and QuadriGram.

For a good represention of the sampled corpora (90% coverage), about 8000 UniGrams are needed, but millions of Bigrams, Trigrams and QuadriGrams are necessary.

4.0 Feedback and Next Steps

  1. A majority of the corpora is dominated by relatively few words. Words that are ‘uncommon’ should be reconsidered in the subsequent modelling works. This would ultimately reduce the memory requirements of the final application.

  2. While not shown here, when stopwords are removed, the siginificant words become very different.

  3. This initial n-gram model may be useful for further implementation of backoff models. Backoff models start with checking an n-gram to predict the outcome. If that fails to give a conclusive answer, it moves to the (n-1)-gram and so on.

  4. The next step is to create a model and integrated into a Shiny Web Application for a predictive text generator.

  5. For a lightweight application, the modelling process should be optimized for performance and storage considerations (i.e. efficient access of stored information). Of course, trade off with accuracy need to be considered very carefully.

  6. Investigation on the sampling of the data will be taken into account (i.e. Segregation according to training, testing and validation, possible increase/decrease in sampling).

  7. Low frequency n-grams will be removed.

  8. Consideration on the context of corpora will be investigated.

  9. The words in n-grams spanning across sentences within a line may not be truly related to one another. Designated special end of sentence characters where there are period ending punction (“.”, “?”, “!”, and perhaps “;”) will be implemented.

  10. Investigation on best possible backoff model that estimates the conditional probability of a word given its history in the n-gram.