Introduction

With the proliferation of mobile devices and other hand-held technologies, the need for efficient text processing has increased. The ability to provide accurate next-word predictions would increase the efficiency and usability of small-sized UI interfaces. This project was created as part of Coursera’s Data Science track’s Capstone course. The goal of the Capstone is to provide a “Shiny” Application to demonstrate a real world Natural Language Processing alogrthim to predict, or recommend, the next-word given a stream of text.

This document is a milestone report showing exploratory analysis of the data made available by Swiftkey.

Exploratory Analysis

System and Session Info

The analysis and algorithm creation is done on OS X Yosemite, Version 10.10.2, 64-bit operating system. Here is the session information from R:

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tm_0.6        NLP_0.1-6     ggplot2_1.0.1 stringi_0.4-1
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-6 digest_0.6.8     evaluate_0.5.5   formatR_1.0     
##  [5] grid_3.1.2       gtable_0.1.2     htmltools_0.2.6  knitr_1.9       
##  [9] MASS_7.3-40      munsell_0.4.2    parallel_3.1.2   plyr_1.8.1      
## [13] proto_0.3-10     Rcpp_0.11.5      reshape2_1.4.1   rmarkdown_0.5.1 
## [17] scales_0.2.4     slam_0.1-32      stringr_0.6.2    tools_3.1.2     
## [21] yaml_2.1.13

Data Acquisition

Swiftkey’s raw data consisted of three files complied from blogs, News and Twitter feeds. It was downloaded from this location:

Capstone Data

The data set included the following directories: de_DE, en_US, fi_FI, ru_RU. For this phase of the project, I will concentrate on the English files which include output each of the three data sources: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt.

File Name Size (in MB) Number of Lines
en_US.blogs.txt 200.4242077 899288
en_US.news.txt 196.2775126 1010242
en_US.twitter.txt 159.364069 2360148

The twitter source contained several unprintable characters which needed to be removed before loading. Each “line” contains a complete text of a blog entry, news article or twitter “tweet”.

Textual Analysis

In raw form, the twitter source has a much larger variety of words than either the blogs or the news.

Here are the word count breakdowns for each of the English sources lines:

A summary of statistics for each of the English sources:

## [1] "Blog Words per Line: "
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
## [1] "News Words per Line: "
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
## [1] "Twitter Words Line: "
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.79   18.00   61.00

It’s clear that the blogs have more words per entry, while the news and twitter sources have much more variety. One cause of the variety of words in the twitter feeds are the more common use of abbreviations and “twitter-speak”.

Sampled Analysis

As the overall size of each source prevents complete word analysis with the equipment I have available, I sampled each source to gather a representative sample of 1000 lines. Profanity, racial epithets and other undesirable words are next removed from the samples lines by use of a black list obtained from Shutterstock. The URL can be found here:

Black List

After removing objectionable terms, whitespace, and numbers, this left 3000 combined lines of text from the English sources for the creation of a corpus of phrases and word groups. From this term document matrix was created, defining 13,315 unique words.

The 30 most common words in the sample are: about, all, and, are, but, can, for, from, had, has, have, his, its, just, more, not, one, out, said, some, that, the, they, this, was, what, will, with, you, your. These would be the expected words that would be found in standard text.

Conclusion

I have been actually fairly surprised by the complexity and challenge so far in preparing the model for prediction algorithms. There are many complications that I did not initially anticipate.

My goal is to build a N-gram factor model. Each n-gram factor will be then be used by a Naive Bayes learning algorthim to make a next word prediction.