Milestone Report of Capstone project

Introduction

In this report, we’re going to check basic features of the data for English data from HC Corpora, which could lead us to the further developing of predictive algorithm.

Data getting & cleaning

Uncompressed volume of en_US data, loaded from the given URL link from SwiftKey is less than 560MB, so, after some playing with quanteda R package - mine 1st finding tells, that we can use any 64-bit box with at least 4-8GB of RAM for working with this data without any “lazy loading” or “load on demand” techniques. This finding simplifies things a bit.
Our data consists of three files with relatively similar sizes (each is 150-200MB), further quantitive analysis will give summary on each of them: blogs.txt, twitter.txt and news.txt (en_US prefix was dropped for simplification).
While file sizes and lines count were checked by standard UNIX utils - other metrics were received from R’s text mining framework - so let’s create corpuses from our three files:

library(quanteda)

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:stats':
## 
##     df

## The following object is masked from 'package:base':
## 
##     sample

en_blogs <- corpus(textfile("final/en_US/en_US.blogs.txt", cache = FALSE))
en_news <- corpus(textfile("final/en_US/en_US.news.txt", cache = FALSE))
en_twitter <- corpus(textfile("final/en_US/en_US.twitter.txt", cache = FALSE))

Let’s check what did we load:

summary(en_blogs, 5)

## Corpus consisting of 1 document.
## 
##   Text  Types   Tokens Sentences
##  text1 460516 43938441   2350680
## 
## Source:  C:/cygwin64/home/Дмитрий/Capstone.project/* on x86-64 by Дмитрий
## Created: Sun Mar 20 22:45:26 2016
## Notes:

summary(en_news, 5)

## Corpus consisting of 1 document.
## 
##   Text  Types  Tokens Sentences
##  text1 104412 3164222    153289
## 
## Source:  C:/cygwin64/home/Дмитрий/Capstone.project/* on x86-64 by Дмитрий
## Created: Sun Mar 20 22:45:27 2016
## Notes:

summary(en_twitter, 5)

## Corpus consisting of 1 document.
## 
##   Text  Types   Tokens Sentences
##  text1 536950 37035421   3761131
## 
## Source:  C:/cygwin64/home/Дмитрий/Capstone.project/* on x86-64 by Дмитрий
## Created: Sun Mar 20 22:45:42 2016
## Notes:

To perform any word/phrase analysis of textual data, we need to clean it from numbers, leading/trailing whitespaces, punctuation, case sensitiviness & known stopwords (like “me/you/they”, “do/does” etc.), in quanteda it can be done during document-frequency matrix creation (check arguments of dfm() method below).
For the real predictive algorithm, we’ll also have to filter profanity words somehow - however, on the current stage, it shouldn’t be a problem. All our results may consider all words, used by authors of our texts.

en_blogs_dfm <- dfm(en_blogs, toLower = TRUE, removeSeparators = TRUE, removePunct = TRUE, removeNumbers= TRUE, ignoredFeatures = stopwords("english"))

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1 document
##    ... indexing features: 432,604 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... created a 1 x 432430 sparse dfm
##    ... complete. 
## Elapsed time: 95.94 seconds.

en_news_dfm <- dfm(en_news, toLower = TRUE, removeSeparators = TRUE, removePunct = TRUE, removeNumbers= TRUE, ignoredFeatures = stopwords("english"))

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1 document
##    ... indexing features: 95,954 feature types
##    ... removed 172 features, from 174 supplied (glob) feature types
##    ... created a 1 x 95782 sparse dfm
##    ... complete. 
## Elapsed time: 6.4 seconds.

en_twitter_dfm <- dfm(en_twitter, toLower = TRUE, removeSeparators = TRUE, removePunct = TRUE,  removeNumbers= TRUE, ignoredFeatures = stopwords("english"))

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 1 document
##    ... indexing features: 439,262 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... created a 1 x 439088 sparse dfm
##    ... complete. 
## Elapsed time: 45.01 seconds.

Computational metrics

blogs file contains about 900k of lines, while twitter contains 2360k of lines, and news contains 1010k. So, we can assume that both of blogs/news consist of complex sentences - while twitter data has simple ones, constructed from 2-3 phrases. This finding is inline with obvious limit of each twitter message (140 chars).
From the summary of blogs/news/twitter corpuses above, we see that news corpus has only 153k of sentences, with only 3m of tokens - 10x less than blogs/twitter. It means that news has much longer, muiltiline sentences in its contents, with a literal text without misprints.
Let’s compute total word count (i.e. number of tokens) of our corpuses:

ntoken(en_blogs_dfm)

##    text1 
## 19768891

ntoken(en_news_dfm)

##   text1 
## 1513040

ntoken(en_twitter_dfm)

##    text1 
## 17058501

As we can see, more than 50% of words from the initial corpuses were filtered out but cleaning procedures,

Let’s also check unique words count (i.e. number of features) for the whole corpus, and for the each of it’s parts:

ntype(en_blogs_dfm)

##  text1 
## 432430

ntype(en_news_dfm)

## text1 
## 95782

ntype(en_twitter_dfm)

##  text1 
## 439088

While blogs/twitter have 10x more of tokens (i.e. total words count), they have only 4x more of features in (i.e. unique words count). Both blogs/twitter will probably have many mistakes/misprints in words & punctuation (ommited spaces between words etc.) - the the real feature counts of these corpuses is expected to be even closer to news.

Let’s check top 20 features of these corpuses:

topfeatures(en_blogs_dfm, 20)

##      s    one   will   just   like    can      t     вђ   time    get 
## 170615 124172 112517 100195  98404  98010  95631  88755  87936  70622 
##   know    now people   also    iвђ    new   even  first   make   back 
##  59925  59412  58943  55283  54779  54204  51995  50710  50541  50314

topfeatures(en_news_dfm, 20)

##   said   will    one    new     вђ      s   also    two    can   year 
##  19169   8467   6400   5329   4922   4850   4515   4436   4395   4229 
##  first   just   last   time  state   like people  years    get   city 
##   4151   4136   4025   3977   3802   3771   3646   3643   3367   2823

topfeatures(en_twitter_dfm, 20)

##   just   like    get   love   good   will    day    can thanks     rt 
## 150987 121981 112291 106036 100639  94658  89899  89680  89462  88750 
##    now    one   know      u  great   time  today     go    lol    new 
##  83587  81900  79766  77071  75955  75453  72715  72343  69623  69605

As we can see, looks like all three corpuses have similar structure in terms of basic words frequency - so in case of 1/2/3-gram analysis - we might be OK to merge these three corpuses (this idea to be validated going further). Another important result from top features list - that one-symbol / special symbol terms to be cleaned probably.

Visualizable features

First interesting plot - to see words frequency histogram - the thing is absolutely unique words are most likely to be misprints, or very rare words - so we literally have no chances to predict them anycase. Another interesting question - at which word’s frequency we should cut off such words? We need to find more or less popular set of words, covering most part of our corpus:

blogs_features <- topfeatures(en_blogs_dfm, 500000)
news_features <- topfeatures(en_news_dfm, 100000)
twitter_features <- topfeatures(en_twitter_dfm, 500000)

plot(blogs_features, log="x")

plot(news_features, log="x")

plot(twitter_features, log="x")

As we can see, most words have almost zero frequency, let’s find count of features with only 1 occurence (they’re almost 100% mistakes or super-unique and can be dropped):

which(blogs_features %in% c(1))[[1]]

## [1] 191226

which(news_features %in% c(1))[[1]]

## [1] 47645

which(twitter_features %in% c(1))[[1]]

## [1] 165618

Some of results

While we won’t discuss today any deep & comprehensive results like decisions on models & selection of base blocks for constructing our predictive algorithm - I can guess from results below, that we can use these three files as base for the same corpus without any differentiation of them - so all three files could be processed by the same logic, and used in the same way for prediciton of phrase entering. It needs to be checked for a certain algorithm of course.
Another key finding about word’s frequency treshold - on base of our metrics - looks like we’re OK to drop all words, which were found less than 2-10 times in our collection of texts. Such filtering should help us a lot in predictive algorithm development (it will recognize only about 10k of words!). It needs to be checked for a certain interval of words of course, plus some unique words with only few occurences in corpus - can be “supported” by a strong n-gram, in such case, this word should be recognized in bounds of such n-gram.
As for n-grams analysis - looks like we have to take 5-10% sample from the given input data to analyse, otherwise 48-64GB of RAM will be required to build 2/3/4-grams on the full blogs/news/twitter corpus. I’ll perform analysis of n-grams with appropriate skipgrams (to check 2-grams except of 1-grams, 3-grams except of 1/2-grams, and so on) - to find reasonable level training data gathering to predict user input mistake well enough.