project by: aiooo
This report describes the preliminary stage of building the prediction model for a keyboard simplify typing on mobile devices. It includes general dataset description, short description of sampling and preprocessing and exploratory analysis. The results highlight the n-gram frequency and distribution. Main conclusions underline the low term-sparsity and relatively high percentage of words recognized as foreign.
The dataset consists of three files (.txt) with publicly available American English texts sourced from blogs (en_US.blogs.txt), news (en_US.news.txt) and twitter posts (en_US.twitter.txt). The original data source can be found here.
According to the prompt info, the whole data corpus size is 583M, with 200M of blogs', 196M of news' and 159M of twitter data.
The blog data consists of 899288 lines of text; news - 1010242 lines and twitter - 2360148 lines.
Due to the large size of the data, the exploratory analysis has been preformed on 10 000-line samples selected from blogs and news files and 50 000-line sample from twitter file.
For the purpose of analysis the following packages have been sourced/called:
library(tm)
library(RWeka)
library(wordcloud)
library(textcat)
For the purpose of data cleaning the following operations have been performed using both the tm and regular expressions' (regex):
Cordinating conjunctions and articles have been removed basing on asssumption that in most sentences they serve as a “startpoint” and are naturally placed at the beginning of a sentence or a phrase. Though it reduces the chances for predicting many idioms and expressions (such as 'bread and butter', 'such a fool'), this still seems to be better solution. Preserving intra-word dashes and apostrophes was followed by removing double and triple dashes and apostrophes (made with regex). Profanity filtering has been based on this dictionary .
The preprocessing included tokenization of sentences and phrases – with dots and commas serving as delimiters. Tokenization changed the number of lines in the files: to 46025 for blogs, 42946 for news and 91374 for twitter.
It is important to evaluate the quota of the sparse words in the text to appropriately 'thin' the word corpus, which is a necessity due to the memory and operation-time saving. To find the level of sparsity, the samples were changed into corpus and sourced from a single directory (as elements of a single list). The vocabulary sparsity was measured after transformation into term-document matrix with the use of tm package.
## <<TermDocumentMatrix (terms: 66142, documents: 3)>>
## Non-/sparse entries: 99893/98533
## Sparsity : 50%
## Maximal term length: 88
## Weighting : binary (bin)
## [1] 66142 3
## [1] 22206 3
Sparsity appeared to be quite low (54%). Further data 'skimming' showed that:
Another task of the exploratory analysis for the language processing is to find the most frequent ngrams (unigrams, bigrams and trigrams - ie. single words, two and three-word clusters) which occure in the corpus. Recognition of the ngrams allows for building the future prediction algorithm. This can be done with the use of RWeka package, performing NGram tokenization. The results below show 20 most frequent unigrams, bigrams and trigrams:
## ngramTok
## to i of in you it is that s on my with
## 37023 31218 24603 21198 17246 16227 14998 14740 14682 11738 9520 9228
## t was at be this have we are
## 7902 7879 7827 7740 7436 7405 7037 6723
## ngramTok
## i m it s don t to be if you that s can t it was
## 3537 3320 2492 2313 1324 1306 1273 1241
## going to i have i was i am will be you re it is i can
## 1237 1236 1222 1166 1116 1113 1095 1075
## to get i love have to want to
## 1011 978 931 921
## ngramTok
## i don t i can t can t wait
## 818 430 348
## going to be i m not i didn t
## 301 301 292
## it s not you don t don t know
## 266 248 233
## i want to i ve been i love you
## 220 218 197
## don t have i have to looking forward to
## 187 179 175
## i m going is going to t wait to
## 171 171 170
## if you re i need to
## 168 164
Sorted ngrams are presented in the form of a wordcloud.
The wordcloud for unigrams (all words occuring at least 750 times):
The wordcloud for bigrams (all ngram occuring at least 500 times):
The wordcloud for trigrams (all ngram occuring at least 150 times):
Lack of intra-word apostrophes is obviously the weak point of this analysis - hopefully to be sorted out in the final report.
Recognition of non-English words can be performed with the help of textcat package, providing tools for text categorization based on n-grams. Unfortunately, the analysis undertaken with the package turned out to be very slow the result - very far from perfect in case of 1,2,3-gram analysis, though it clearly improved with the size of ngrams. Due to the low efficiency of the package, the language analysis has been preformed on 5000-line sample of the data.
The number of ngrams recognized as English (=TRUE):
unigram analysis
## Mode FALSE TRUE NA's
## logical 28078 6205 0
trigram analysis
## Mode FALSE TRUE NA's
## logical 13382 11736 0
mixed (unigram, bigram and trigram) analysis
## Mode FALSE TRUE NA's
## logical 58770 27582 0