Data Science Capstone: Milestone 1 Report

Overview

The Data Science Capstone project involves developing a model to predict a next word given a sequence of text.

The first stage of the project is exploratory in nature. A corpus of text data provided by Swiftkey is loaded into R and some basic exploratory analysis is performed. Once the basic corpus is explored, the data is split into data frequency matrices for words, bigrams (pairs of consecutive words), and trigrams (triplets of consecutive words). These will provide the basis for a preliminary ngram prediction model.

All code used to generate the tables and graphs in this report is included in the appendix.

Libraries

The initial exploratory analysis and tokenization makes use of the tm and quanteda R packages.

Swiftkey Data

The Swiftkey data can be downloaded from the capstone website. The data includes large text datasets in several languages. For the purposes of this project, only the English data is used.

Full Corpus

First, the entire Swiftkey English corpus is loaded. A basic summary is performed to understand the scope of the data.

## Corpus consisting of 3 documents.
## 
##   Text  Types   Tokens Sentences author       datetimestamp description
##  text1 389796 43336963   2076869   <NA> 2017-02-15 17:35:14        <NA>
##  text2 328192 40948593   1868560   <NA> 2017-02-15 17:35:14        <NA>
##  text3 516136 36985902   2578053   <NA> 2017-02-15 17:35:14        <NA>
##  heading                id language origin
##     <NA>   en_US.blogs.txt    en_US   <NA>
##     <NA>    en_US.news.txt    en_US   <NA>
##     <NA> en_US.twitter.txt    en_US   <NA>
## 
## Source:  Converted from tm VCorpus 'en'
## Created: Wed Feb 15 09:36:33 2017
## Notes:

The Swiftkey data is large (roughly 120M tokens and 6.5M sentences) – so much so that manipulating it will tax the limits of computers with limited RAM space.

Sample Data and Corpus

Instead of using the full dataset, random samples are generated for each of the three data files, resulting in files that are approximately 10% of the original size. A corpus is created from the sampled data files. A basic summary follows.

## Corpus consisting of 3 documents.
## 
##   Text  Types  Tokens Sentences author       datetimestamp description
##  text1 118995 4344996    207873   <NA> 2017-02-15 17:45:58        <NA>
##  text2 112856 4094457    186472   <NA> 2017-02-15 17:45:58        <NA>
##  text3 132791 3712354    259333   <NA> 2017-02-15 17:45:58        <NA>
##  heading                 id language origin
##     <NA>   sample.blogs.txt       en   <NA>
##     <NA>    sample.news.txt       en   <NA>
##     <NA> sample.twitter.txt       en   <NA>
## 
## Source:  Converted from tm VCorpus 'en'
## Created: Wed Feb 15 09:46:01 2017
## Notes:

The sampled corpus contains about 12M total tokens, compared to the full corpus size of 120M tokens.

(Note that at this stage of the project, the data has not yet been split into training, validation and test sets, as no model building/evaluation has taken place. Reserving test/validation data will occur in the next phase of the project.)

Data Frequency and Exploratory Analysis

A data frequency matrix is created from the corpus. Profanity, punctuation, symbols, numbers, hashtags are removed from the corpus.

Stopwords (short, common words such as “the”) are also removed from the data. The motivation for removing stopwords is that it does not seem to be very helpful to predict short common words that could be easily typed. Because they are so common, the overall accuracy of the model may be significantly worse (depending on the methodology by which accuracy is measured) since the model will never predict a stopword. However, this will hopefully be offset by a higher utility to the user. The decision to include/exclude stopwords can be revisited once a preliminary model has been evaluated.

The dataset can be further reduced by dropping low frequency words. Roughly 85% of the unique words are dropped by removing words with occurence rates of less than 10, but this still retains approximately 94% of all word instances.

## [1] "Total words:  209702"

## [1] "Words retained:  30942"

## [1] "Ratio of unique words retained:  0.148"

## [1] "Ratio of total word instances retained:  0.938"

##  will  just  said   one  like   can   get  time   new  good 
## 31589 30706 30477 29039 27312 24837 22818 21428 19510 18077

After removing stopwords, profanity, and words with counts less than 10, a histogram of word distributions is generated showing the distribution of word frequencies. There are over 3000 words that occur 10 times, about 500 words that occur 20 times, nearly 300 that occur 30 times, and approximately 200 that occur 40 times. The occurance rates continue to drop off and become sparse at the other end of the distribution. The most frequently occuring words are summarized using the topfeatures() function from the quanteda library.

Bigrams and Trigrams

A similar procedure can be followed to generate lists of bigrams and trigrams. Here the tokenization must be explicitly performed and stopwords and profanity removed prior to generating the data frequency matrices for bigrams and trigrams (see code for details). Singletons, that is bigrams and trigrams occuring less than twice, are removed. Because the bigrams and trigrams are much more diverse than the single words, the instance retention rates are only 46% and 5.6% respectively.

## [1] "Total Bigrams:  3543352"

## [1] "Bigrams retained:  561642"

## [1] "Ratio of unique bigrams retained:  0.159"

## [1] "Ratio of total bigram instances retained:  0.459"

##       right_now        new_york       last_year      last_night 
##            2596            1981            1838            1655 
##     high_school       feel_like       years_ago       last_week 
##            1456            1346            1325            1288 
##      first_time looking_forward 
##            1229            1180

## [1] "Total Trigrams:  5310281"

## [1] "Trigrams retained:  112222"

## [1] "Ratio of unique trigrams retained:  0.021"

## [1] "Ratio ot total trigrams retatined:  0.056"

##          new_york_city            let_us_know         happy_new_year 
##                    266                    255                    200 
##      happy_mothers_day     happy_mother's_day president_barack_obama 
##                    179                    166                    158 
##          cinco_de_mayo         new_york_times          two_years_ago 
##                    127                    126                    121 
## looking_forward_seeing 
##                    115

The bigram and trigram histograms show that the vast majority of the ngrams occur only a handful of times. There are about 400 bigrams that occur 30 times, and about 100 that occur 50 times. The most frequent bigram (“right now”) occurs 2596 times. There are less than 400 trigrams that occur 10 times, and less than 100 that occur 20 times. The most frequent trigram (“new york city”) occurs only 266 times.

Next Steps

The next step in developing an ngram based prediction model is to smooth the ngram probabilities, accounting for unseen ngrams. This smoothing can take several forms, from simple models (+1 added to occurence rates for each token) to more complex models, such as Good-Turing discounting, or Kneyser-Ney smoothing. The authors of the course have also indicated that a Katz back off model be investigated. One or more of these will be implemented in the next phase of the project. Once the probabilities of the bigrams and trigrams are determined, these can be used to predict the next most likely words in a sequence (e.g. given two words, determine the most probable trigrams containing those words, if no matching trigrams exist, back off to bigrams, etc.).

Appendix: Code

Following are code segments used in the production of this report.

Libraries

library(tm)
library(quanteda)
library(data.table)

Swiftkey Data

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# Download the zip file if it does not exist
if (!file.exists("./swiftkey.zip")) {
    download.file(fileUrl, "./swiftkey.zip", method="curl")
}

# Data is unpacked to a directory called "final" -- if it does not exist, then unzip
if (!file.exists("final")) {
  unzip("./swiftkey.zip")
}

Full Corpus

# English directory
en_dir <- "./final/en_US"

# Load corpus
en <- VCorpus(DirSource(en_dir, encoding="UTF-8"),readerControl = list(language="en_US"))

#convert VCorpus to a quanteda corpus and summarize
enc <- corpus(en)
rm(en)
summary(enc)

Sample Data and Corpus

if (!file.exists("./samples/sample.blogs.txt")) {
  rcon <- file("./final/en_US/en_US.blogs.txt", "rt")
  wcon <- file("./samples/sample.blogs.txt", "wt")
  sampleFiles(rcon,wcon,0.1)
}
if (!file.exists("./samples/sample.news.txt")) {
  rcon <- file("./final/en_US/en_US.news.txt", "rt")
  wcon <- file("./samples/sample.news.txt", "wt")
  sampleFiles(rcon,wcon,0.1)
}
if (!file.exists("./samples/sample.twitter.txt")) {
  rcon <- file("./final/en_US/en_US.twitter.txt", "rt")
  wcon <- file("./samples/sample.twitter.txt", "wt")
  sampleFiles(rcon,wcon,0.1)
}

# Sample directory
sample_dir <- "./samples"
en <- VCorpus(DirSource(sample_dir, encoding="UTF-8"),readerControl = list(language="en"))

#convert VCorpus to a quanteda corpus and summarize
enc <- corpus(en)
rm(en)
summary(enc)

Data Frequency and Exploratory Analysis

#Get list of profanity words, remove trailing commas and get rid of header info
prof <- read.csv("./Terms-to-Block.csv")
prof <- apply(prof, 2, function(e) gsub(',','',e))
prof <- prof[4:726,2]

# Use dfm() to get word frequencies, removing stopwords, profanity, punctuation, symbols, numbers, hastags
wordDfm <- dfm(enc, ngram=1, remove = c(stopwords("english"),prof), removePunct=TRUE, removeSymbols=TRUE, removeNumbers=TRUE, removeTwitter=TRUE)
wordDfm <- dfm_sort(wordDfm)
worddt <- as.data.table(wordDfm)

# Counts of individual tokens and total word instances
counts <- apply(worddt,2,function(e) sum(e))
instances <- sum(counts)

# Trim low frequency words
wordTrim <- dfm_trim(wordDfm,min_count=10)
wordTrimdt <- as.data.table(wordTrim)
trimcounts <- apply(wordTrimdt,2,function(e) sum(e))
triminstances <- sum(trimcounts)
retained <- triminstances/instances
paste("Total words: ", length(counts))
paste("Words retained: ", length(trimcounts))
paste("Ratio of unique words retained: ", round(length(trimcounts)/length(counts),3))
paste("Ratio of total word instances retained: ", round(retained,3))

# create histogram of word counts and list top word features
hist(trimcounts,breaks=25000,xlim=c(10,50),ylim=c(0,3500),xlab="Word Counts",main="Histogram of  Word Counts")
topfeatures(wordTrim)

Bigrams and Trigrams

#Need to tokenize and remove stopwords and profanity before dfm
tkns <- tokenize(tolower(enc), removePunct=TRUE, removeSymbols=TRUE, removeNumbers=TRUE, removeTwitter=TRUE)
tkns <- removeFeatures(tkns, c(stopwords("english"),prof))
tkns_bigrams <- tokens_ngrams(tkns, 2)
 
myDfm2 <- dfm(tkns_bigrams)
myDfm2 <- dfm_sort(myDfm2)
rm(tkns_bigrams)
bidf <- as.data.frame(as.matrix(myDfm2))

# Counts of bigram tokens and total bigram instances
bicounts <- apply(bidf,2,function(e) sum(e))
biinstances <- sum(bicounts)

# Trim bigrams with count < 2
biTrim <- dfm_trim(myDfm2,min_count=2)
biTrimdt <- as.data.table(biTrim)
biTrimcounts <- apply(biTrimdt,2,function(e) sum(e))
biTriminstances <- sum(biTrimcounts)
biretained <- biTriminstances/biinstances
paste("Total Bigrams: ", length(bicounts))
paste("Bigrams retained: ", length(biTrimcounts))
paste("Ratio of unique bigrams retained: ", round(length(biTrimcounts)/length(bicounts),3))
paste("Ratio of total bigram instances retained: ", round(biretained,3))

hist(bicounts,breaks=3000,xlim=c(0,50),ylim=c(0,1000),xlab="Bigram Counts",main="Histogram of Bigram Counts")
topfeatures(biTrim)

# Trigrams
tkns_trigrams <- tokens_ngrams(tkns, 3)
myDfm3 <- dfm(tkns_trigrams)
myDfm3 <- dfm_sort(myDfm3)
rm(tkns_trigrams)
tridf <- as.data.frame(as.matrix(myDfm3))

# Counts of trigram tokens and total trigram instances
tricounts <- apply(tridf,2,function(e) sum(e))
triinstances <- sum(tricounts)

# Trim trigrams with count < 2
triTrim <- dfm_trim(myDfm3,min_count=2)
triTrimdf <- as.data.frame(as.matrix(triTrim))
triTrimcounts <- apply(triTrimdf,2,function(e) sum(e))
triTriminstances <- sum(triTrimcounts)
triretained <- triTriminstances/triinstances
paste("Total Trigrams: ", length(tricounts))
paste("Trigrams retained: ", length(triTrimcounts))
paste("Ratio of unique trigrams retained: ", round(length(triTrimcounts)/length(tricounts),3))
paste("Ratio ot total trigrams retatined: ", round(triretained,3))

hist(tricounts,breaks=300,xlim=c(0,50),ylim=c(0,1000),xlab="Trigram Counts",main="Histogram of Trigram Counts")
topfeatures(triTrim)

Data Science Capstone: Milestone 1 Report

Scott D. Koenigsman

January 21, 2017

Overview

Libraries

Swiftkey Data

Full Corpus

Sample Data and Corpus

Data Frequency and Exploratory Analysis

Bigrams and Trigrams

Next Steps

Appendix: Code

Libraries

Swiftkey Data

Full Corpus

Sample Data and Corpus

Data Frequency and Exploratory Analysis

Bigrams and Trigrams