Exploratory Analysis

System and Session Info

The analysis and algorithm creation is done on OS X Yosemite, Version 10.10.2, 64-bit operating system. Here is the session information from R:

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tm_0.6        NLP_0.1-6     ggplot2_1.0.1 stringi_0.4-1
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-6 digest_0.6.8     evaluate_0.5.5   formatR_1.0     
##  [5] grid_3.1.2       gtable_0.1.2     htmltools_0.2.6  knitr_1.9       
##  [9] MASS_7.3-40      munsell_0.4.2    parallel_3.1.2   plyr_1.8.1      
## [13] proto_0.3-10     Rcpp_0.11.5      reshape2_1.4.1   rmarkdown_0.5.1 
## [17] scales_0.2.4     slam_0.1-32      stringr_0.6.2    tools_3.1.2     
## [21] yaml_2.1.13

Data Acquisition

Swiftkey’s raw data consisted of three files complied from blogs, News and Twitter feeds. It was downloaded from this location:

Capstone Data

The data set included the following directories: de_DE, en_US, fi_FI, ru_RU. For this phase of the project, I will concentrate on the English files which include output each of the three data sources: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt.

File Name	Size (in MB)	Number of Lines
en_US.blogs.txt	200.4242077	899288
en_US.news.txt	196.2775126	1010242
en_US.twitter.txt	159.364069	2360148

The twitter source contained several unprintable characters which needed to be removed before loading. Each “line” contains a complete text of a blog entry, news article or twitter “tweet”.

Textual Analysis

In raw form, the twitter source has a much larger variety of words than either the blogs or the news.

Here are the word count breakdowns for each of the English sources lines:

A summary of statistics for each of the English sources:

## [1] "Blog Words per Line: "

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

## [1] "News Words per Line: "

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

## [1] "Twitter Words Line: "

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.79   18.00   61.00

It’s clear that the blogs have more words per entry, while the news and twitter sources have much more variety. One cause of the variety of words in the twitter feeds are the more common use of abbreviations and “twitter-speak”.

Sampled Analysis

As the overall size of each source prevents complete word analysis with the equipment I have available, I sampled each source to gather a representative sample of 1000 lines. Profanity, racial epithets and other undesirable words are next removed from the samples lines by use of a black list obtained from Shutterstock. The URL can be found here:

Black List

After removing objectionable terms, whitespace, and numbers, this left 3000 combined lines of text from the English sources for the creation of a corpus of phrases and word groups. From this term document matrix was created, defining 13,315 unique words.

The 30 most common words in the sample are: about, all, and, are, but, can, for, from, had, has, have, his, its, just, more, not, one, out, said, some, that, the, they, this, was, what, will, with, you, your. These would be the expected words that would be found in standard text.

Coursera Capstone: Swiftkey Prediction Milestone Report

Kenneth D. Graves

March 29, 2015

Introduction

Exploratory Analysis

System and Session Info

Data Acquisition

Textual Analysis

Sampled Analysis

Conclusion