In this report, we’re going to check basic features of the data for English data from HC Corpora, which could lead us to the further developing of predictive algorithm.
Uncompressed volume of en_US data, loaded from the given URL link from SwiftKey is less than 560MB, so, after some playing with quanteda R package - mine 1st finding tells, that we can use any 64-bit box with at least 4-8GB of RAM for working with this data without any “lazy loading” or “load on demand” techniques. This finding simplifies things a bit.
Our data consists of three files with relatively similar sizes (each is 150-200MB), further quantitive analysis will give summary on each of them: blogs.txt, twitter.txt and news.txt (en_US prefix was dropped for simplification).
While file sizes and lines count were checked by standard UNIX utils - other metrics were received from R’s text mining framework - so let’s create corpuses from our three files:
library(quanteda)
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:stats':
##
## df
## The following object is masked from 'package:base':
##
## sample
en_blogs <- corpus(textfile("final/en_US/en_US.blogs.txt", cache = FALSE))
en_news <- corpus(textfile("final/en_US/en_US.news.txt", cache = FALSE))
en_twitter <- corpus(textfile("final/en_US/en_US.twitter.txt", cache = FALSE))
summary(en_blogs, 5)
## Corpus consisting of 1 document.
##
## Text Types Tokens Sentences
## text1 460516 43938441 2350680
##
## Source: C:/cygwin64/home/Дмитрий/Capstone.project/* on x86-64 by Дмитрий
## Created: Sun Mar 20 22:45:26 2016
## Notes:
summary(en_news, 5)
## Corpus consisting of 1 document.
##
## Text Types Tokens Sentences
## text1 104412 3164222 153289
##
## Source: C:/cygwin64/home/Дмитрий/Capstone.project/* on x86-64 by Дмитрий
## Created: Sun Mar 20 22:45:27 2016
## Notes:
summary(en_twitter, 5)
## Corpus consisting of 1 document.
##
## Text Types Tokens Sentences
## text1 536950 37035421 3761131
##
## Source: C:/cygwin64/home/Дмитрий/Capstone.project/* on x86-64 by Дмитрий
## Created: Sun Mar 20 22:45:42 2016
## Notes:
To perform any word/phrase analysis of textual data, we need to clean it from numbers, leading/trailing whitespaces, punctuation, case sensitiviness & known stopwords (like “me/you/they”, “do/does” etc.), in quanteda it can be done during document-frequency matrix creation (check arguments of dfm() method below).
For the real predictive algorithm, we’ll also have to filter profanity words somehow - however, on the current stage, it shouldn’t be a problem. All our results may consider all words, used by authors of our texts.
en_blogs_dfm <- dfm(en_blogs, toLower = TRUE, removeSeparators = TRUE, removePunct = TRUE, removeNumbers= TRUE, ignoredFeatures = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1 document
## ... indexing features: 432,604 feature types
## ... removed 174 features, from 174 supplied (glob) feature types
## ... created a 1 x 432430 sparse dfm
## ... complete.
## Elapsed time: 95.94 seconds.
en_news_dfm <- dfm(en_news, toLower = TRUE, removeSeparators = TRUE, removePunct = TRUE, removeNumbers= TRUE, ignoredFeatures = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1 document
## ... indexing features: 95,954 feature types
## ... removed 172 features, from 174 supplied (glob) feature types
## ... created a 1 x 95782 sparse dfm
## ... complete.
## Elapsed time: 6.4 seconds.
en_twitter_dfm <- dfm(en_twitter, toLower = TRUE, removeSeparators = TRUE, removePunct = TRUE, removeNumbers= TRUE, ignoredFeatures = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 1 document
## ... indexing features: 439,262 feature types
## ... removed 174 features, from 174 supplied (glob) feature types
## ... created a 1 x 439088 sparse dfm
## ... complete.
## Elapsed time: 45.01 seconds.
blogs file contains about 900k of lines, while twitter contains 2360k of lines, and news contains 1010k. So, we can assume that both of blogs/news consist of complex sentences - while twitter data has simple ones, constructed from 2-3 phrases. This finding is inline with obvious limit of each twitter message (140 chars).
From the summary of blogs/news/twitter corpuses above, we see that news corpus has only 153k of sentences, with only 3m of tokens - 10x less than blogs/twitter. It means that news has much longer, muiltiline sentences in its contents, with a literal text without misprints.
Let’s compute total word count (i.e. number of tokens) of our corpuses:
ntoken(en_blogs_dfm)
## text1
## 19768891
ntoken(en_news_dfm)
## text1
## 1513040
ntoken(en_twitter_dfm)
## text1
## 17058501
As we can see, more than 50% of words from the initial corpuses were filtered out but cleaning procedures,
ntype(en_blogs_dfm)
## text1
## 432430
ntype(en_news_dfm)
## text1
## 95782
ntype(en_twitter_dfm)
## text1
## 439088
While blogs/twitter have 10x more of tokens (i.e. total words count), they have only 4x more of features in (i.e. unique words count). Both blogs/twitter will probably have many mistakes/misprints in words & punctuation (ommited spaces between words etc.) - the the real feature counts of these corpuses is expected to be even closer to news.
topfeatures(en_blogs_dfm, 20)
## s one will just like can t вђ time get
## 170615 124172 112517 100195 98404 98010 95631 88755 87936 70622
## know now people also iвђ new even first make back
## 59925 59412 58943 55283 54779 54204 51995 50710 50541 50314
topfeatures(en_news_dfm, 20)
## said will one new вђ s also two can year
## 19169 8467 6400 5329 4922 4850 4515 4436 4395 4229
## first just last time state like people years get city
## 4151 4136 4025 3977 3802 3771 3646 3643 3367 2823
topfeatures(en_twitter_dfm, 20)
## just like get love good will day can thanks rt
## 150987 121981 112291 106036 100639 94658 89899 89680 89462 88750
## now one know u great time today go lol new
## 83587 81900 79766 77071 75955 75453 72715 72343 69623 69605
As we can see, looks like all three corpuses have similar structure in terms of basic words frequency - so in case of 1/2/3-gram analysis - we might be OK to merge these three corpuses (this idea to be validated going further). Another important result from top features list - that one-symbol / special symbol terms to be cleaned probably.
blogs_features <- topfeatures(en_blogs_dfm, 500000)
news_features <- topfeatures(en_news_dfm, 100000)
twitter_features <- topfeatures(en_twitter_dfm, 500000)
plot(blogs_features, log="x")
plot(news_features, log="x")
plot(twitter_features, log="x")
which(blogs_features %in% c(1))[[1]]
## [1] 191226
which(news_features %in% c(1))[[1]]
## [1] 47645
which(twitter_features %in% c(1))[[1]]
## [1] 165618
While we won’t discuss today any deep & comprehensive results like decisions on models & selection of base blocks for constructing our predictive algorithm - I can guess from results below, that we can use these three files as base for the same corpus without any differentiation of them - so all three files could be processed by the same logic, and used in the same way for prediciton of phrase entering. It needs to be checked for a certain algorithm of course.
Another key finding about word’s frequency treshold - on base of our metrics - looks like we’re OK to drop all words, which were found less than 2-10 times in our collection of texts. Such filtering should help us a lot in predictive algorithm development (it will recognize only about 10k of words!). It needs to be checked for a certain interval of words of course, plus some unique words with only few occurences in corpus - can be “supported” by a strong n-gram, in such case, this word should be recognized in bounds of such n-gram.
As for n-grams analysis - looks like we have to take 5-10% sample from the given input data to analyse, otherwise 48-64GB of RAM will be required to build 2/3/4-grams on the full blogs/news/twitter corpus. I’ll perform analysis of n-grams with appropriate skipgrams (to check 2-grams except of 1-grams, 3-grams except of 1/2-grams, and so on) - to find reasonable level training data gathering to predict user input mistake well enough.