SwiftKey Coursera Project - Milestone Report

project by: aiooo

Summary

This report describes the preliminary stage of building the prediction model for a keyboard simplify typing on mobile devices. It includes general dataset description, short description of sampling and preprocessing and exploratory analysis. The results highlight the n-gram frequency and distribution. Main conclusions underline the low term-sparsity and relatively high percentage of words recognized as foreign.

Dataset description

The dataset consists of three files (.txt) with publicly available American English texts sourced from blogs (en_US.blogs.txt), news (en_US.news.txt) and twitter posts (en_US.twitter.txt). The original data source can be found here.

According to the prompt info, the whole data corpus size is 583M, with 200M of blogs', 196M of news' and 159M of twitter data.

The blog data consists of 899288 lines of text; news - 1010242 lines and twitter - 2360148 lines.

Due to the large size of the data, the exploratory analysis has been preformed on 10 000-line samples selected from blogs and news files and 50 000-line sample from twitter file.

For the purpose of analysis the following packages have been sourced/called:

library(tm)
library(RWeka)
library(wordcloud)
library(textcat)

Data cleaning and tokenization

For the purpose of data cleaning the following operations have been performed using both the tm and regular expressions' (regex):

changing all letters to lowercase
removing numbers
removing articles (“a”, “an”, “the”)
removing coordinating conjunctions (“for”, “and”, “nor”, “but”, “or”, “yet”, “so”)
punctuation-removing (without the intra word dashes and apostrpohes)
removing excessive whitespace
profanity filtering

Cordinating conjunctions and articles have been removed basing on asssumption that in most sentences they serve as a “startpoint” and are naturally placed at the beginning of a sentence or a phrase. Though it reduces the chances for predicting many idioms and expressions (such as 'bread and butter', 'such a fool'), this still seems to be better solution. Preserving intra-word dashes and apostrophes was followed by removing double and triple dashes and apostrophes (made with regex). Profanity filtering has been based on this dictionary .

The preprocessing included tokenization of sentences and phrases – with dots and commas serving as delimiters. Tokenization changed the number of lines in the files: to 46025 for blogs, 42946 for news and 91374 for twitter.

Sparsity inspection

It is important to evaluate the quota of the sparse words in the text to appropriately 'thin' the word corpus, which is a necessity due to the memory and operation-time saving. To find the level of sparsity, the samples were changed into corpus and sourced from a single directory (as elements of a single list). The vocabulary sparsity was measured after transformation into term-document matrix with the use of tm package.

## <<TermDocumentMatrix (terms: 66142, documents: 3)>>
## Non-/sparse entries: 99893/98533
## Sparsity           : 50%
## Maximal term length: 88
## Weighting          : binary (bin)

## [1] 66142     3

## [1] 22206     3

Sparsity appeared to be quite low (54%). Further data 'skimming' showed that:

66142 unique words cover 90% of the corpus vocabulary,
22206 unique cover 50% of the corpus vocabulary.

Frequency analysis

Another task of the exploratory analysis for the language processing is to find the most frequent ngrams (unigrams, bigrams and trigrams - ie. single words, two and three-word clusters) which occure in the corpus. Recognition of the ngrams allows for building the future prediction algorithm. This can be done with the use of RWeka package, performing NGram tokenization. The results below show 20 most frequent unigrams, bigrams and trigrams:

## ngramTok
##    to     i    of    in   you    it    is  that     s    on    my  with 
## 37023 31218 24603 21198 17246 16227 14998 14740 14682 11738  9520  9228 
##     t   was    at    be  this  have    we   are 
##  7902  7879  7827  7740  7436  7405  7037  6723

## ngramTok
##      i m     it s    don t    to be   if you   that s    can t   it was 
##     3537     3320     2492     2313     1324     1306     1273     1241 
## going to   i have    i was     i am  will be   you re    it is    i can 
##     1237     1236     1222     1166     1116     1113     1095     1075 
##   to get   i love  have to  want to 
##     1011      978      931      921

## ngramTok
##            i don t            i can t         can t wait 
##                818                430                348 
##        going to be            i m not           i didn t 
##                301                301                292 
##           it s not          you don t         don t know 
##                266                248                233 
##          i want to          i ve been         i love you 
##                220                218                197 
##         don t have          i have to looking forward to 
##                187                179                175 
##          i m going        is going to          t wait to 
##                171                171                170 
##          if you re          i need to 
##                168                164

Sorted ngrams are presented in the form of a wordcloud.

The wordcloud for unigrams (all words occuring at least 750 times):

plot of chunk unnamed-chunk-7

The wordcloud for bigrams (all ngram occuring at least 500 times):

plot of chunk unnamed-chunk-8

The wordcloud for trigrams (all ngram occuring at least 150 times):

plot of chunk unnamed-chunk-9

Lack of intra-word apostrophes is obviously the weak point of this analysis - hopefully to be sorted out in the final report.

Foreign language analysis

Recognition of non-English words can be performed with the help of textcat package, providing tools for text categorization based on n-grams. Unfortunately, the analysis undertaken with the package turned out to be very slow the result - very far from perfect in case of 1,2,3-gram analysis, though it clearly improved with the size of ngrams. Due to the low efficiency of the package, the language analysis has been preformed on 5000-line sample of the data.

The number of ngrams recognized as English (=TRUE):

unigram analysis

##    Mode   FALSE    TRUE    NA's 
## logical   28078    6205       0

trigram analysis

##    Mode   FALSE    TRUE    NA's 
## logical   13382   11736       0

mixed (unigram, bigram and trigram) analysis

##    Mode   FALSE    TRUE    NA's 
## logical   58770   27582       0