This is an exploratory analysis of the SwiftKey HC corpus, with an eye toward creating a text predictor.
Acquiring the data:
install.packages("R.utils", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'R.utils' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\James\AppData\Local\Temp\RtmpUJlNrn\downloaded_packages
library(R.utils)
## Warning: package 'R.utils' was built under R version 3.2.4
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.2.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 3.2.3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following object is masked from 'package:quanteda':
##
## trim
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v2.2.0 (2015-12-09) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
#Acquire the files
setwd("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "Coursera-SwiftKey.zip", method = "wget")
## Warning: running command 'wget "https://d396qusza40orc.cloudfront.net/
## dsscapstone/dataset/Coursera-SwiftKey.zip" -O "Coursera-SwiftKey.zip"' had
## status 127
## Warning in download.file(url, destfile = "Coursera-SwiftKey.zip", method =
## "wget"): download had nonzero exit status
unzip("data/Coursera-SwiftKey.zip")
## Warning in unzip("data/Coursera-SwiftKey.zip"): error 1 in extracting from
## zip file
#After unzipping, files we want are in ~/final/en_US
# ~/en_US/en_US.blogs, ".news, " .twitter
We’ll create our corpus using tm and then convert it to quanteda for faster processing.
install.packages("tm", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'tm' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\James\AppData\Local\Temp\Rtmp8UycQj\downloaded_packages
install.packages("SnowballC", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'SnowballC' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\James\AppData\Local\Temp\Rtmp8UycQj\downloaded_packages
library(tm)
## Warning: package 'tm' was built under R version 3.2.3
## Loading required package: NLP
encorp <- VCorpus(DirSource(directory="final/en_US", encoding = "UTF-8"))
inspect(encorp)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 206824505
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15639408
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 162096031
We learn that: First entry, blogs.txt, contains 206,824,505 characters. Second entry, news.txt, contains 15,639,408 characters. Third entry, twitter.txt, contains 162,096,031 characters.
Now let’s take a look at it using quanteda.
install.packages("quanteda", repos = 'http://cran.us.r-project.org')
## Installing package into 'C:/Users/James/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
## package 'quanteda' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\James\AppData\Local\Temp\Rtmp8UycQj\downloaded_packages
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.2.4
##
## Attaching package: 'quanteda'
##
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
##
## The following object is masked from 'package:NLP':
##
## ngrams
##
## The following object is masked from 'package:stats':
##
## df
##
## The following object is masked from 'package:base':
##
## sample
data(encorp, package="tm")
## Warning in data(encorp, package = "tm"): data set 'encorp' not found
encorp <- corpus(encorp)
summary(encorp)
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences author datetimestamp
## en_US.blogs.txt 389805 43336963 2081674 <NA> 2016-03-20 20:57:15
## en_US.news.txt 100136 3151102 145003 <NA> 2016-03-20 20:57:15
## en_US.twitter.txt 516143 36985902 2582122 <NA> 2016-03-20 20:57:15
## description heading id language origin
## <NA> <NA> en_US.blogs.txt en <NA>
## <NA> <NA> en_US.news.txt en <NA>
## <NA> <NA> en_US.twitter.txt en <NA>
##
## Source: Converted from tm VCorpus 'encorp'
## Created: Sun Mar 20 16:58:30 2016
## Notes:
This tells us the number of “types” (unique words), “tokens” (here, total words) and sentences for each.
Let’s also get some line counts.
#Line counts:
#Blogs
countLines("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE/final/en_US/en_US.blogs.txt")
## [1] 899288
## attr(,"lastLineHasNewline")
## [1] TRUE
#News
countLines("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE/final/en_US/en_US.news.txt")
## [1] 1010242
## attr(,"lastLineHasNewline")
## [1] TRUE
#Twitter
countLines("C:/Users/James/Desktop/Rprogramming/1-CAPSTONE/final/en_US/en_US.twitter.txt")
## [1] 2360148
## attr(,"lastLineHasNewline")
## [1] TRUE
Now let’s make a document feature matrix to go deeper.
qdfm <- dfm(encorp)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 694,705 feature types
## ... created a 3 x 694705 sparse dfm
## ... complete.
## Elapsed time: 229.13 seconds.
topfeatures(qdfm, 20)
## the to and a i of in you is
## 2945826 1922767 1598287 1574354 1508054 1293467 1023532 853242 813052
## for that it on my with this was be
## 775250 721303 714593 571952 565524 479670 432262 413257 407675
## have at
## 398337 374740
All very common words. This will be useful for text-prediction later, but let’s also explore it without the most common words.
nqdfm <- dfm(encorp, ignoredFeatures = stopwords("english"), stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 694,705 feature types
## Warning in rep_len(x, n): Reached total allocation of 12176Mb: see
## help(memory.size)
## Warning in rep_len(x, n): Reached total allocation of 12176Mb: see
## help(memory.size)
## ... removed 174 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 124553 feature variants
## ... created a 3 x 569978 sparse dfm
## ... complete.
## Elapsed time: 220.04 seconds.
topfeatures(nqdfm, 20)
## just get like one will go time can love day
## 255825 245952 245234 227871 220785 215030 197646 193806 190107 183157
## make know good thank now see work new think want
## 158388 157364 156474 150190 146840 134955 130581 129519 128445 126109
Let’s break this down by document
#Blogs
topfeatures(nqdfm[1], 20)
## one will like time just can get go make day
## 134044 115917 110317 106379 100645 99194 94820 82442 80419 71042
## know year love use thing work peopl want now think
## 69357 66651 64805 64405 62565 62226 61630 61041 60168 59589
#News
topfeatures(nqdfm[2], 20)
## said will year one new state time say get like also can
## 19171 8701 8256 6677 5338 5184 5122 4877 4676 4602 4515 4457
## two go first just last make work peopl
## 4439 4357 4153 4145 4126 3961 3906 3765
#Twitter
topfeatures(nqdfm[3], 20)
## just get thank like go love day good will can
## 151035 146456 131699 130315 128231 124124 108785 102521 96167 90155
## rt one time know now follow u great today see
## 89580 87150 86145 85925 83861 78452 77171 77079 76901 75661
Let’s take a look at the distribution of words for each document in the corpus, with and without the most common words:
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.2.4
##
## Attaching package: 'quanteda'
##
## The following object is masked from 'package:stats':
##
## df
##
## The following object is masked from 'package:base':
##
## sample
par(mfrow = c(2, 3))
hist(topfeatures(qdfm[1], 20), main="Blogs w/ stopwords", xlab="Words")
hist(topfeatures(qdfm[2], 20), main="News w/ stopwords", xlab="Words")
hist(topfeatures(qdfm[3], 20), main="Twitter w/ stopwords", xlab="Words")
hist(topfeatures(nqdfm[1], 20), main="Blogs no stopwords", xlab="Words")
hist(topfeatures(nqdfm[2], 20), main="News no stopwords", xlab="Words")
hist(topfeatures(nqdfm[3], 20), main="Twitter no stopwords", xlab="Words")
It’s interesting that removing stopwords makes little difference to Twitter, but does significantly change the most common words for blogs and news.
There are some useful differences in word usage, like the frequency of “rt” on Twitter, that raise the possibility of using different predictors for different media.
Our current plan is to make 2- and 3-grams using the ngrams() function, and use predict.textmodel for a straightforward operation. A modification may be allowing users on Shiny to select what type(s) of ngrams to use, and possibly select skipgrams as well, so they can see how different approaches affect text prediction.