Synopsis

This paper presents a first exploratory analysis of a corpus made of SMS, news an twitter messages in English language. The original data can be found at corpora.heliohost.org but the John Hopkins University and SwiftKey made from it a specific package of texts in the Capstone Dataset that includes English, Russian, Finnish and German texts. We will concentrate on the English data in this analysis. Due to the sheer size of the corpus, I will select a portion of it, then do some cleaning, find out the key patterns (frequency of words and 2-tokens) and finally propose a strategy for word prediction in the context of, say, typing an sms or a tweet.

Extraction and Exploratory Analysis of the Data

I will use several libraries

## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Using command line functions within R (system("wc -l ./en_US/*.txt") and the like) we can easily find out the number of words and lines, respectively, in the documents

## [1] " 37334690 ./en_US/en_US.blogs.txt"  
## [2] " 34372720 ./en_US/en_US.news.txt"   
## [3] " 30374206 ./en_US/en_US.twitter.txt"
## [4] " 102081616 total"
##        V1                        V2    name
## 1  899288   ./en_US/en_US.blogs.txt   blogs
## 2 1010242    ./en_US/en_US.news.txt    news
## 3 2360148 ./en_US/en_US.twitter.txt twitter

So there are more than 30 million words in each file, for a total of 102,081,616 words. The blogs have 899,288 lines, the news 1,010,242 and the twitter file 2,360,148 lines.

Due to the sheer size of files I will select randomly 1% of each document’s total lines. I did try with 20% but the calculation of the Document/term matrix took more than 15 minutes to process each time. In addition by taking even 2% of the data, the eventual Document-Term Matrix of the bigrams was too big to be handled with the as.matrix conversion for further statistical processing.

For future refinement of the algorithm I will have to find a way to use more data, for example with the slam package, which I did not have time to explore at this moment.

The random selection is done after set a seed (for reproducibility) and using the caret package’s function createDataPartition. I deliberately chose to merge my 3 selections into one document. Even though one might that typing styles and habits might be different for an SMS, a tweet or of course for news. As a future target, we could tune the prediction model for the type of document the user is typing, but at this moment, I will not make this disctinction.

After randomly selecting and merging, I created a training set by randomly selecting 70% of the dataset and setting aside 30% as test data.

So the training data set has 85400 lines while the test data set has 25620 lines.

## [1] 42704
## [1] 29896
## [1] 12808

Then, from the training data, I created a corpus that I cleaned up with various transformation in the following order: put all words in lower case, remove swear words, remove punctuation and numbers as well as “strange” characters (⁰, ½ and ¾), remove words that have triple characters like aaahh, treeend, etc… Finally I removed redundant white spaces (which is more a cosmetic act, than really useful for our purpose).

Removing bad words might be subject to controversy, since some people intentionally use them and would like the prediction algorith to predict them. On the other hand, it might lead to unintentional use of bad words that might end up in an SMS to the boss or colleague. I used 2 sources of bad words: bannedwordlist.com and Google which I combined into one list.

I made the decision not to remove stopwords and not to stem the corpus since these 2 techniques would prevent the algorithm from predicting simple words like the or initially (which would have been stemmed into initial).

Document-Term Matrix analysis

At this point we can create the Document-Term matrix, a big matrix showing which word is present in which doucment. In our case a document is a single line of text. Therefore individual words will be generally be present in only few documents (=lines) since those lines are relatively short. Hence the sparsity is 100%. Here below we can see the first 6 words of that matrix (in alphabetic order), but there are terms in this cleaned up Corpus.

dtmTrainingCorp <- DocumentTermMatrix(trainingCorp)  # sparsity of 100% !
dtmTrainingCorp$ncol  # 45652 terms
## [1] 45652
inspect(dtmTrainingCorp[1:6,1:5])
## <<DocumentTermMatrix (documents: 6, terms: 5)>>
## Non-/sparse entries: 0/30
## Sparsity           : 100%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs aaba aactive aacu aaliyah aam
##    1    0       0    0       0   0
##    2    0       0    0       0   0
##    3    0       0    0       0   0
##    4    0       0    0       0   0
##    5    0       0    0       0   0
##    6    0       0    0       0   0

Let’s have a look at this DocumentTermMatrix, after removing the one-letter words (although for true prediction based on n-grams, it will be necessary to keep them in the next step of this study), as well as words longer than 20 characters (which are most likely mistyping when one forgets the space character). In addition I sorted out words that appear in fewer than 12 documents (=lines, in our case).

dtmTrainingCorpControl2L <- DocumentTermMatrix(trainingCorp,
                                 control=list(wordLengths=c(2, 20),
                                    bounds = list(global = c(12,Inf))))
freqCount2L <- colSums(as.matrix( dtmTrainingCorpControl2L))
sorted2L <- head(sort(freqCount2L, decreasing = TRUE), 40)
sorted2L
##   the    to   and    of    in   for    is  that   you    it    on  with 
## 33680 19594 16990 14004 11603  7588  7538  7198  6455  6327  5783  4988 
##   was    my    at    be  this  have   are    as   but    we    he   not 
##  4369  4307  4034  3975  3922  3649  3533  3465  3396  3007  2977  2847 
##  from    so    me   its   all    by  will  they  said about    or  your 
##  2814  2680  2609  2577  2281  2271  2246  2222  2186  2134  2132  2132 
##    up    an   his  just 
##  2110  2108  2095  2045

By summing up the occurences of each word and then sorting, we can see which 20 words of minimum 2 letters appear the most in this training Corpus.

Let’s have a look at the most frequent bigrams of the original training data set (now including back the one-letter words).

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtmTrainingCorpControlBigram <- DocumentTermMatrix(trainingCorp,
                                                   control = list(tokenize = BigramTokenizer))  # 328775
dtmTrainingCorpControlBigram2 <- removeSparseTerms(dtmTrainingCorpControlBigram,
                                                   0.99995)
inspect(dtmTrainingCorpControlBigram2) # 32414 with 0.9999; 61776 with 0.99995
## <<DocumentTermMatrix (documents: 29896, terms: 61776)>>
## Non-/sparse entries: 388732/1846466564
## Sparsity           : 100%
## Maximal term length: 27
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   and the at the for the in a in the of the on the to be to the
##   2073       2      2       1    1      1      5      0     1      0
##   2507       0      2       1    0      5      2      0     0      1
##   2806       1      3       1    0      3     11      3     1      4
##   3751       0      1       3    0      1      2      0     0      1
##   3958       0      0       1    0      3      0      1     0      0
##   4797       1      0       0    0      2      0      0     0      1
##   4895       2      0       0    0      4      2      1     1      0
##   5917       2      0       0    0      2      3      1     0      1
##   5996       0      1       1    1      1      1      0     1      1
##   6034       0      0       0    0      1      1      1     2      1
##       Terms
## Docs   with the
##   2073        2
##   2507        0
##   2806        1
##   3751        1
##   3958        0
##   4797        0
##   4895        0
##   5917        1
##   5996        0
##   6034        0
freqCountBigram2 <- colSums(as.matrix( dtmTrainingCorpControlBigram2))
sortedBigram2 <- head(sort(freqCountBigram2, decreasing = TRUE), 40)

By removing the sparse terms, we reduce the size of the bigram matrix from 328775 terms down to 32414 terms. Here are the 40 most frequent, shown as a diagram:

Let’s have a look at the most frequent trigrams after removing the most sparse trigrams:

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtmTrainingCorpTrigram <- DocumentTermMatrix(trainingCorp,
                                                   control = list(tokenize = TrigramTokenizer))
dtmTrainingCorpTrigram2 <- removeSparseTerms(dtmTrainingCorpTrigram, 0.9999665)
inspect(dtmTrainingCorpTrigram2) # 35375 with 0.99995 and 0.9999665; but back to 550074 with 0.9999666
## <<DocumentTermMatrix (documents: 29896, terms: 35375)>>
## Non-/sparse entries: 121085/1057449915
## Sparsity           : 100%
## Maximal term length: 38
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   a lot of going to be i want to it was a looking forward to
##   175         0           0         0        0                  0
##   2388        0           0         1        0                  0
##   2507        0           0         0        0                  0
##   2806        0           0         0        1                  0
##   2978        0           2         0        0                  0
##   3751        0           0         0        0                  0
##   4797        0           0         0        1                  0
##   4895        1           0         0        1                  0
##   5917        0           0         0        0                  0
##   5996        0           0         1        0                  0
##       Terms
## Docs   one of the out of the thanks for the the end of to be a
##   175           0          1              0          0       0
##   2388          1          0              0          0       0
##   2507          0          0              0          0       0
##   2806          0          1              0          0       1
##   2978          0          0              0          1       0
##   3751          0          1              0          0       0
##   4797          0          0              0          0       0
##   4895          0          0              0          0       0
##   5917          1          0              0          0       0
##   5996          0          0              0          1       0
freqCountTrigram2 <- colSums(as.matrix( dtmTrainingCorpTrigram2))
sortedTrigram2 <- head(sort(freqCountTrigram2, decreasing = TRUE), 40)

By removing the sparse trigrams, we reduce the size of the trigram matrix from 550074 trigrams down to 35375 trigrams. Here are the 40 most frequent, shown as a diagram: