Introduction

This is an exploratory analysis of the SwiftKey HC corpus, with an eye toward creating a text predictor.

Acquiring the data:

install.packages("R.utils", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'R.utils' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\James\AppData\Local\Temp\RtmpUJlNrn\downloaded_packages
library(R.utils)
## Warning: package 'R.utils' was built under R version 3.2.4
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.2.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 3.2.3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following object is masked from 'package:quanteda':
## 
##     trim
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v2.2.0 (2015-12-09) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
#Acquire the files
setwd("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "Coursera-SwiftKey.zip", method = "wget")
## Warning: running command 'wget "https://d396qusza40orc.cloudfront.net/
## dsscapstone/dataset/Coursera-SwiftKey.zip" -O "Coursera-SwiftKey.zip"' had
## status 127
## Warning in download.file(url, destfile = "Coursera-SwiftKey.zip", method =
## "wget"): download had nonzero exit status
unzip("data/Coursera-SwiftKey.zip")
## Warning in unzip("data/Coursera-SwiftKey.zip"): error 1 in extracting from
## zip file
#After unzipping, files we want are in ~/final/en_US
# ~/en_US/en_US.blogs, ".news, " .twitter

We’ll create our corpus using tm and then convert it to quanteda for faster processing.

Summary statistics and data tables

install.packages("tm", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'tm' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\James\AppData\Local\Temp\Rtmp8UycQj\downloaded_packages
install.packages("SnowballC", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'SnowballC' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\James\AppData\Local\Temp\Rtmp8UycQj\downloaded_packages
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
encorp <- VCorpus(DirSource(directory="final/en_US", encoding = "UTF-8"))

inspect(encorp)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 206824505
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15639408
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 162096031

We learn that: First entry, blogs.txt, contains 206,824,505 characters. Second entry, news.txt, contains 15,639,408 characters. Third entry, twitter.txt, contains 162,096,031 characters.

Now let’s take a look at it using quanteda.

install.packages("quanteda", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'quanteda' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\James\AppData\Local\Temp\Rtmp8UycQj\downloaded_packages
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.2.4
## 
## Attaching package: 'quanteda'
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## 
## The following object is masked from 'package:NLP':
## 
##     ngrams
## 
## The following object is masked from 'package:stats':
## 
##     df
## 
## The following object is masked from 'package:base':
## 
##     sample
data(encorp, package="tm")
## Warning in data(encorp, package = "tm"): data set 'encorp' not found
encorp <- corpus(encorp)
summary(encorp)
## Corpus consisting of 3 documents.
## 
##               Text  Types   Tokens Sentences author       datetimestamp
##    en_US.blogs.txt 389805 43336963   2081674   <NA> 2016-03-20 20:57:15
##     en_US.news.txt 100136  3151102    145003   <NA> 2016-03-20 20:57:15
##  en_US.twitter.txt 516143 36985902   2582122   <NA> 2016-03-20 20:57:15
##  description heading                id language origin
##         <NA>    <NA>   en_US.blogs.txt       en   <NA>
##         <NA>    <NA>    en_US.news.txt       en   <NA>
##         <NA>    <NA> en_US.twitter.txt       en   <NA>
## 
## Source:  Converted from tm VCorpus 'encorp'
## Created: Sun Mar 20 16:58:30 2016
## Notes:

This tells us the number of “types” (unique words), “tokens” (here, total words) and sentences for each.

Let’s also get some line counts.

#Line counts:

#Blogs
countLines("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE/final/en_US/en_US.blogs.txt")
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE
#News
countLines("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE/final/en_US/en_US.news.txt")
## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE
#Twitter
countLines("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE/final/en_US/en_US.twitter.txt")
## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE

Document-feature matrix

Now let’s make a document feature matrix to go deeper.

qdfm <- dfm(encorp)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 694,705 feature types
##    ... created a 3 x 694705 sparse dfm
##    ... complete. 
## Elapsed time: 229.13 seconds.
topfeatures(qdfm, 20)
##     the      to     and       a       i      of      in     you      is 
## 2945826 1922767 1598287 1574354 1508054 1293467 1023532  853242  813052 
##     for    that      it      on      my    with    this     was      be 
##  775250  721303  714593  571952  565524  479670  432262  413257  407675 
##    have      at 
##  398337  374740

All very common words. This will be useful for text-prediction later, but let’s also explore it without the most common words.

nqdfm <- dfm(encorp, ignoredFeatures = stopwords("english"), stem = TRUE)
## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 694,705 feature types
## Warning in rep_len(x, n): Reached total allocation of 12176Mb: see
## help(memory.size)
## Warning in rep_len(x, n): Reached total allocation of 12176Mb: see
## help(memory.size)
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... stemming features (English), trimmed 124553 feature variants
##    ... created a 3 x 569978 sparse dfm
##    ... complete. 
## Elapsed time: 220.04 seconds.
topfeatures(nqdfm, 20)
##   just    get   like    one   will     go   time    can   love    day 
## 255825 245952 245234 227871 220785 215030 197646 193806 190107 183157 
##   make   know   good  thank    now    see   work    new  think   want 
## 158388 157364 156474 150190 146840 134955 130581 129519 128445 126109

Let’s break this down by document

#Blogs
topfeatures(nqdfm[1], 20)
##    one   will   like   time   just    can    get     go   make    day 
## 134044 115917 110317 106379 100645  99194  94820  82442  80419  71042 
##   know   year   love    use  thing   work  peopl   want    now  think 
##  69357  66651  64805  64405  62565  62226  61630  61041  60168  59589
#News
topfeatures(nqdfm[2], 20)
##  said  will  year   one   new state  time   say   get  like  also   can 
## 19171  8701  8256  6677  5338  5184  5122  4877  4676  4602  4515  4457 
##   two    go first  just  last  make  work peopl 
##  4439  4357  4153  4145  4126  3961  3906  3765
#Twitter
topfeatures(nqdfm[3], 20)
##   just    get  thank   like     go   love    day   good   will    can 
## 151035 146456 131699 130315 128231 124124 108785 102521  96167  90155 
##     rt    one   time   know    now follow      u  great  today    see 
##  89580  87150  86145  85925  83861  78452  77171  77079  76901  75661

Basic plots

Let’s take a look at the distribution of words for each document in the corpus, with and without the most common words:

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.2.4
## 
## Attaching package: 'quanteda'
## 
## The following object is masked from 'package:stats':
## 
##     df
## 
## The following object is masked from 'package:base':
## 
##     sample
par(mfrow = c(2, 3))

hist(topfeatures(qdfm[1], 20), main="Blogs w/ stopwords", xlab="Words")
hist(topfeatures(qdfm[2], 20), main="News w/ stopwords", xlab="Words")
hist(topfeatures(qdfm[3], 20), main="Twitter w/ stopwords", xlab="Words")

hist(topfeatures(nqdfm[1], 20), main="Blogs no stopwords", xlab="Words")
hist(topfeatures(nqdfm[2], 20), main="News no stopwords", xlab="Words")
hist(topfeatures(nqdfm[3], 20), main="Twitter no stopwords", xlab="Words")

Interesting findings

It’s interesting that removing stopwords makes little difference to Twitter, but does significantly change the most common words for blogs and news.

There are some useful differences in word usage, like the frequency of “rt” on Twitter, that raise the possibility of using different predictors for different media.

Plans for prediction algorithm

Our current plan is to make 2- and 3-grams using the ngrams() function, and use predict.textmodel for a straightforward operation. A modification may be allowing users on Shiny to select what type(s) of ngrams to use, and possibly select skipgrams as well, so they can see how different approaches affect text prediction.