The goal of this project is to develop a model Shiny Web Application for a predictive text generator, which predicts the next word a user intends to type based on word frequency and context.
The natural language processing techniques for this project was executed on a computer with core i5 capabilities @ 3.10 GHzand 8.00 GB RAM.
This intermediate report describes the foremost step of this project; understanding the distribution and relationship between the words and tokens in the text. It meets the following benchmarks:
Demonstrates the approach used to download and clean data
Creates a basic report of summary statistics about the data sets
Reports on some interesting findings
Gives feedback on the plan for creating a prediction algorithm and Shiny app
Data was provided from a content archived from heliohost.org. It can be retrieved from this {Online Data Source}.
Further information on the data can be found {here}.The data is provided in different languages. For this project,only the corpora (blog post, news articles and tweets) in the en_US local (US English) are considered.
rm(list=ls())
library(stringi); library(knitr);
library(quanteda); library(ggplot2);
library(wordcloud); library(RColorBrewer);
library(doParallel);library(parallel)
set.seed(123)
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)
It is assumed that the data has been unzipped and saved locally and a working directory has been established.
The data is read in binary format to preserve all characters and to ensure smoother analysis. {Reference}
R is fairly slow in reading files. The options for reading are listed from ‘slowest’ to ‘fastest’: read.table(), scan(),readLines(). {Reference}
con <- file("en_US.blogs.txt", open = "rb")
blogs <- readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
close(con)
con <- file("en_US.news.txt", open = "rb")
news <- readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
close(con)
con <- file("en_US.twitter.txt", open = "rb")
twitter <- readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
close(con)
A basic summary is developed for each of the three dataset.
| File Name | File Size (Mb) | Word Count | Line Count | Max Char Count per Line | Min Char Count per Line |
|---|---|---|---|---|---|
| Blogs | 200.42 | 37546246 | 899288 | 40833 | 1 |
| News | 196.28 | 34762395 | 1010242 | 11384 | 1 |
| 159.36 | 30093410 | 2360148 | 140 | 2 |
The file size shown here does not take into account of the associated metadata. {Reference}
It can be observed that
* 556 MB of space is required to load all three files. Hence, sampling of data is recomended for quicker analysis.
* The Twitter data file has the most lines but fewest words, which is expected given the character limit enforced on that medium.
A sample of raw text from the corpora is shown as follows.It can be observed that the text requires further processing to remove unrecognized characters (not supported languages, emojis, etc). {Reference}
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [2] "We love you Mr. Brown."
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
Sampling is done using a binomial function and derived from 5% of each file. Simple text cleaning is done to remove non-ASCII characters and other special characters, such as emoticons. Subsequently, each sample is then saved separately and combined into a ‘master’ sample. {Reference}
set.seed(123)
sblogs <- blogs[sample(1:length(blogs), 0.050 * length(blogs), replace = FALSE)]
snews <- news[sample(1:length(news), 0.050 * length(news), replace = FALSE)]
stwitter <- twitter[sample(1:length(twitter), 0.050 * length(twitter), replace = FALSE)]
samplecombined <- c(sblogs,snews,stwitter)
A basic summary is developed for each of the three sample dataset.
| File Name | File Size (Mb) | Word Count | Line Count | Max Char Count per Line | Min Char Count per Line |
|---|---|---|---|---|---|
| Blogs | 19.93 | 1882049 | 44964 | 5384 | 0 |
| News | 19.62 | 1742558 | 50512 | 2071 | 2 |
| 15.97 | 1504873 | 118007 | 140 | 3 |
A corpus of the samples are created. A list of 378 English terms that could be perceived as offensive is used as a general basis for profanity filtering. The online list (from Shutterstock) can be found {here}.
The Quanteda package is used for this analysis for its computational efficiency and simplificity compared to the conventional tm package. Both methods have been tested on and it the former package is preferred. Further readings on the advantages of Quanteda can be found {here}.
Tokenization is then implemented, as follow:{Reference 1} {Reference 2}
Note that stopwords (most common words of the English language) are not removed and stemming of words are not done, to preserve the accuracy of N-grams. Stopwords will be removed at a later stage as recommended by the creators of Quanteda. {Reference}
The top 25 most frequent UniGram (1 token) and their number of occurences are listed here.
## S3 method for class 'corpus'
dfm_uni <- dfm(allcorpus ,ngrams=1, verbose = TRUE, tolower = TRUE,
remove_numbers = TRUE, remove_punct= TRUE, remove_separators= TRUE,
remove_twitter=TRUE, remove_url=TRUE, remove_symbols=TRUE,
ignoreFeatures=profanity,
language = "english", thesaurus = NULL, dictionary = NULL,
valuetype = c("glob", "regex", "fixed"), simplify= TRUE)
save(dfm_uni,file="unigram.RData")
top_uni <- topfeatures(dfm_uni, 25) #top 25 words
top_uni
## the to and a of i in for is that
## 239811 137956 120869 118966 100470 82988 82720 55303 53719 52294
## you it on with was my at this be have
## 47284 46358 40851 35522 31335 30198 28808 27277 27245 26394
## are but as he we
## 24653 24342 24152 21458 21009
Using the same function, the corpora is tokenized accordingly for BiGram (2 token), TriGram (3 token) and QuadriGram (4 token). The top 25 most frequent BiGram, TriGram and QuadriGram are listed here.
## of_the in_the to_the for_the on_the to_be at_the and_the
## 21709 20748 10788 10242 9727 7988 7219 6464
## in_a with_the is_a it_was for_a from_the i_was i_have
## 5949 5273 4948 4778 4692 4442 4400 4307
## and_i it_is with_a will_be going_to of_a if_you i_am
## 4240 4212 4127 4043 4039 4027 3779 3722
## have_a
## 3715
## one_of_the a_lot_of thanks_for_the
## 1733 1523 1199
## to_be_a going_to_be i_want_to
## 892 869 766
## the_end_of it_was_a out_of_the
## 759 747 713
## some_of_the as_well_as be_able_to
## 690 688 687
## part_of_the i_have_a looking_forward_to
## 632 593 585
## the_rest_of thank_you_for i_have_to
## 581 560 527
## a_couple_of this_is_a i_need_to
## 503 499 490
## the_first_time is_going_to i_love_you
## 487 485 478
## end_of_the
## 473
## the_end_of_the the_rest_of_the at_the_end_of
## 388 364 319
## thanks_for_the_follow for_the_first_time at_the_same_time
## 309 278 261
## one_of_the_most to_be_able_to is_going_to_be
## 232 209 207
## when_it_comes_to in_the_middle_of is_one_of_the
## 201 195 190
## going_to_be_a thanks_for_the_rt if_you_want_to
## 182 178 160
## thank_you_for_the one_of_the_best can't_wait_to_see
## 159 157 142
## in_the_united_states i_don't_want_to thank_you_so_much
## 130 129 126
## by_the_end_of the_middle_of_the the_top_of_the
## 125 125 118
## i_am_going_to
## 116
Histograms provide a visualization of the distribution of words and pattern of various n-grams. Similarly, WordClouds provide an insight of prominent words/ngrams from content point of view.
Histogram and WordCloud of UniGram for All Corpora
Histogram and WordCloud of BiGram for All Corpora
Histogram and WordCloud of TriGram for All Corpora
Histogram and WordCloud of QuadriGram for All Corpora
The frequency of bigrams is approximately ten times lesser then unigrams. The frequency of trigrams is approximately ten times lesser then bigrams and a hundred times lesser then unigrams.For Quadrigrams, their frequencies are relatively small.
Unique word analysis provides an insight on how much words are needed to cover a part of corpus. It is basically a Cumulative Distribution Function (CDF) analysis. A 50% coverage refers to the number of unique words representing 50% of corpus.
143 unique UniGrams account for 50% and 7958 unique Unigrams are needed for 90% of the corpora.
39078 unique BiGrams account for 50% and 1097146 unique Bigrams accountfor 90% of the corpora.
1075952 unique TriGrams account for 50% and 2908655 unique Trigrams account for 90% of the corpora.
1075952 unique QuadriGrams account for 50% and 2908655 unique Quadrigrams account for 90% of the corpora.
The UniGram distribution follows a conventional CDF curve. The BiGram distribution follows a unique CDF curve before sharply reaching a linear distribution. This is similar to the TriGram and QuadriGram.
For a good represention of the sampled corpora (90% coverage), about 8000 UniGrams are needed, but millions of Bigrams, Trigrams and QuadriGrams are necessary.
A majority of the corpora is dominated by relatively few words. Words that are ‘uncommon’ should be reconsidered in the subsequent modelling works. This would ultimately reduce the memory requirements of the final application.
While not shown here, when stopwords are removed, the siginificant words become very different.
This initial n-gram model may be useful for further implementation of backoff models. Backoff models start with checking an n-gram to predict the outcome. If that fails to give a conclusive answer, it moves to the (n-1)-gram and so on.
The next step is to create a model and integrated into a Shiny Web Application for a predictive text generator.
For a lightweight application, the modelling process should be optimized for performance and storage considerations (i.e. efficient access of stored information). Of course, trade off with accuracy need to be considered very carefully.
Investigation on the sampling of the data will be taken into account (i.e. Segregation according to training, testing and validation, possible increase/decrease in sampling).
Low frequency n-grams will be removed.
Consideration on the context of corpora will be investigated.
The words in n-grams spanning across sentences within a line may not be truly related to one another. Designated special end of sentence characters where there are period ending punction (“.”, “?”, “!”, and perhaps “;”) will be implemented.
Investigation on best possible backoff model that estimates the conditional probability of a word given its history in the n-gram.