This paper presents a first exploratory analysis of a corpus made of SMS, news an twitter messages in English language. The original data can be found at corpora.heliohost.org but the John Hopkins University and SwiftKey made from it a specific package of texts in the Capstone Dataset that includes English, Russian, Finnish and German texts. We will concentrate on the English data in this analysis. Due to the sheer size of the corpus, I will select a portion of it, then do some cleaning, find out the key patterns (frequency of words and 2-tokens) and finally propose a strategy for word prediction in the context of, say, typing an sms or a tweet.
I will use several libraries
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Using command line functions within R (system("wc -l ./en_US/*.txt")
and the like) we can easily find out the number of words and lines, respectively, in the documents
## [1] " 37334690 ./en_US/en_US.blogs.txt"
## [2] " 34372720 ./en_US/en_US.news.txt"
## [3] " 30374206 ./en_US/en_US.twitter.txt"
## [4] " 102081616 total"
## V1 V2 name
## 1 899288 ./en_US/en_US.blogs.txt blogs
## 2 1010242 ./en_US/en_US.news.txt news
## 3 2360148 ./en_US/en_US.twitter.txt twitter
So there are more than 30 million words in each file, for a total of 102,081,616 words. The blogs have 899,288 lines, the news 1,010,242 and the twitter file 2,360,148 lines.
Due to the sheer size of files I will select randomly 1% of each document’s total lines. I did try with 20% but the calculation of the Document/term matrix took more than 15 minutes to process each time. In addition by taking even 2% of the data, the eventual Document-Term Matrix of the bigrams was too big to be handled with the as.matrix
conversion for further statistical processing.
For future refinement of the algorithm I will have to find a way to use more data, for example with the slam
package, which I did not have time to explore at this moment.
The random selection is done after set a seed (for reproducibility) and using the caret
package’s function createDataPartition
. I deliberately chose to merge my 3 selections into one document. Even though one might that typing styles and habits might be different for an SMS, a tweet or of course for news. As a future target, we could tune the prediction model for the type of document the user is typing, but at this moment, I will not make this disctinction.
After randomly selecting and merging, I created a training set by randomly selecting 70% of the dataset and setting aside 30% as test data.
So the training
data set has 85400 lines while the test
data set has 25620 lines.
## [1] 42704
## [1] 29896
## [1] 12808
Then, from the training data, I created a corpus that I cleaned up with various transformation in the following order: put all words in lower case, remove swear words, remove punctuation and numbers as well as “strange” characters (⁰, ½ and ¾), remove words that have triple characters like aaahh, treeend, etc… Finally I removed redundant white spaces (which is more a cosmetic act, than really useful for our purpose).
Removing bad words might be subject to controversy, since some people intentionally use them and would like the prediction algorith to predict them. On the other hand, it might lead to unintentional use of bad words that might end up in an SMS to the boss or colleague. I used 2 sources of bad words: bannedwordlist.com and Google which I combined into one list.
I made the decision not to remove stopwords and not to stem the corpus since these 2 techniques would prevent the algorithm from predicting simple words like the or initially (which would have been stemmed into initial).
At this point we can create the Document-Term matrix, a big matrix showing which word is present in which doucment. In our case a document is a single line of text. Therefore individual words will be generally be present in only few documents (=lines) since those lines are relatively short. Hence the sparsity is 100%. Here below we can see the first 6 words of that matrix (in alphabetic order), but there are terms in this cleaned up Corpus.
dtmTrainingCorp <- DocumentTermMatrix(trainingCorp) # sparsity of 100% !
dtmTrainingCorp$ncol # 45652 terms
## [1] 45652
inspect(dtmTrainingCorp[1:6,1:5])
## <<DocumentTermMatrix (documents: 6, terms: 5)>>
## Non-/sparse entries: 0/30
## Sparsity : 100%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aaba aactive aacu aaliyah aam
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
Let’s have a look at this DocumentTermMatrix, after removing the one-letter words (although for true prediction based on n-grams, it will be necessary to keep them in the next step of this study), as well as words longer than 20 characters (which are most likely mistyping when one forgets the space character). In addition I sorted out words that appear in fewer than 12 documents (=lines, in our case).
dtmTrainingCorpControl2L <- DocumentTermMatrix(trainingCorp,
control=list(wordLengths=c(2, 20),
bounds = list(global = c(12,Inf))))
freqCount2L <- colSums(as.matrix( dtmTrainingCorpControl2L))
sorted2L <- head(sort(freqCount2L, decreasing = TRUE), 40)
sorted2L
## the to and of in for is that you it on with
## 33680 19594 16990 14004 11603 7588 7538 7198 6455 6327 5783 4988
## was my at be this have are as but we he not
## 4369 4307 4034 3975 3922 3649 3533 3465 3396 3007 2977 2847
## from so me its all by will they said about or your
## 2814 2680 2609 2577 2281 2271 2246 2222 2186 2134 2132 2132
## up an his just
## 2110 2108 2095 2045
By summing up the occurences of each word and then sorting, we can see which 20 words of minimum 2 letters appear the most in this training Corpus.
Let’s have a look at the most frequent bigrams of the original training data set (now including back the one-letter words).
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtmTrainingCorpControlBigram <- DocumentTermMatrix(trainingCorp,
control = list(tokenize = BigramTokenizer)) # 328775
dtmTrainingCorpControlBigram2 <- removeSparseTerms(dtmTrainingCorpControlBigram,
0.99995)
inspect(dtmTrainingCorpControlBigram2) # 32414 with 0.9999; 61776 with 0.99995
## <<DocumentTermMatrix (documents: 29896, terms: 61776)>>
## Non-/sparse entries: 388732/1846466564
## Sparsity : 100%
## Maximal term length: 27
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs and the at the for the in a in the of the on the to be to the
## 2073 2 2 1 1 1 5 0 1 0
## 2507 0 2 1 0 5 2 0 0 1
## 2806 1 3 1 0 3 11 3 1 4
## 3751 0 1 3 0 1 2 0 0 1
## 3958 0 0 1 0 3 0 1 0 0
## 4797 1 0 0 0 2 0 0 0 1
## 4895 2 0 0 0 4 2 1 1 0
## 5917 2 0 0 0 2 3 1 0 1
## 5996 0 1 1 1 1 1 0 1 1
## 6034 0 0 0 0 1 1 1 2 1
## Terms
## Docs with the
## 2073 2
## 2507 0
## 2806 1
## 3751 1
## 3958 0
## 4797 0
## 4895 0
## 5917 1
## 5996 0
## 6034 0
freqCountBigram2 <- colSums(as.matrix( dtmTrainingCorpControlBigram2))
sortedBigram2 <- head(sort(freqCountBigram2, decreasing = TRUE), 40)
By removing the sparse terms, we reduce the size of the bigram matrix from 328775 terms down to 32414 terms. Here are the 40 most frequent, shown as a diagram:
Let’s have a look at the most frequent trigrams after removing the most sparse trigrams:
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtmTrainingCorpTrigram <- DocumentTermMatrix(trainingCorp,
control = list(tokenize = TrigramTokenizer))
dtmTrainingCorpTrigram2 <- removeSparseTerms(dtmTrainingCorpTrigram, 0.9999665)
inspect(dtmTrainingCorpTrigram2) # 35375 with 0.99995 and 0.9999665; but back to 550074 with 0.9999666
## <<DocumentTermMatrix (documents: 29896, terms: 35375)>>
## Non-/sparse entries: 121085/1057449915
## Sparsity : 100%
## Maximal term length: 38
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs a lot of going to be i want to it was a looking forward to
## 175 0 0 0 0 0
## 2388 0 0 1 0 0
## 2507 0 0 0 0 0
## 2806 0 0 0 1 0
## 2978 0 2 0 0 0
## 3751 0 0 0 0 0
## 4797 0 0 0 1 0
## 4895 1 0 0 1 0
## 5917 0 0 0 0 0
## 5996 0 0 1 0 0
## Terms
## Docs one of the out of the thanks for the the end of to be a
## 175 0 1 0 0 0
## 2388 1 0 0 0 0
## 2507 0 0 0 0 0
## 2806 0 1 0 0 1
## 2978 0 0 0 1 0
## 3751 0 1 0 0 0
## 4797 0 0 0 0 0
## 4895 0 0 0 0 0
## 5917 1 0 0 0 0
## 5996 0 0 0 1 0
freqCountTrigram2 <- colSums(as.matrix( dtmTrainingCorpTrigram2))
sortedTrigram2 <- head(sort(freqCountTrigram2, decreasing = TRUE), 40)
By removing the sparse trigrams, we reduce the size of the trigram matrix from 550074 trigrams down to 35375 trigrams. Here are the 40 most frequent, shown as a diagram: