The Data Science Capstone project involves developing a model to predict a next word given a sequence of text.
The first stage of the project is exploratory in nature. A corpus of text data provided by Swiftkey is loaded into R and some basic exploratory analysis is performed. Once the basic corpus is explored, the data is split into data frequency matrices for words, bigrams (pairs of consecutive words), and trigrams (triplets of consecutive words). These will provide the basis for a preliminary ngram prediction model.
All code used to generate the tables and graphs in this report is included in the appendix.
The initial exploratory analysis and tokenization makes use of the tm and quanteda R packages.
The Swiftkey data can be downloaded from the capstone website. The data includes large text datasets in several languages. For the purposes of this project, only the English data is used.
First, the entire Swiftkey English corpus is loaded. A basic summary is performed to understand the scope of the data.
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences author datetimestamp description
## text1 389796 43336963 2076869 <NA> 2017-02-15 17:35:14 <NA>
## text2 328192 40948593 1868560 <NA> 2017-02-15 17:35:14 <NA>
## text3 516136 36985902 2578053 <NA> 2017-02-15 17:35:14 <NA>
## heading id language origin
## <NA> en_US.blogs.txt en_US <NA>
## <NA> en_US.news.txt en_US <NA>
## <NA> en_US.twitter.txt en_US <NA>
##
## Source: Converted from tm VCorpus 'en'
## Created: Wed Feb 15 09:36:33 2017
## Notes:
The Swiftkey data is large (roughly 120M tokens and 6.5M sentences) – so much so that manipulating it will tax the limits of computers with limited RAM space.
Instead of using the full dataset, random samples are generated for each of the three data files, resulting in files that are approximately 10% of the original size. A corpus is created from the sampled data files. A basic summary follows.
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences author datetimestamp description
## text1 118995 4344996 207873 <NA> 2017-02-15 17:45:58 <NA>
## text2 112856 4094457 186472 <NA> 2017-02-15 17:45:58 <NA>
## text3 132791 3712354 259333 <NA> 2017-02-15 17:45:58 <NA>
## heading id language origin
## <NA> sample.blogs.txt en <NA>
## <NA> sample.news.txt en <NA>
## <NA> sample.twitter.txt en <NA>
##
## Source: Converted from tm VCorpus 'en'
## Created: Wed Feb 15 09:46:01 2017
## Notes:
The sampled corpus contains about 12M total tokens, compared to the full corpus size of 120M tokens.
(Note that at this stage of the project, the data has not yet been split into training, validation and test sets, as no model building/evaluation has taken place. Reserving test/validation data will occur in the next phase of the project.)
A data frequency matrix is created from the corpus. Profanity, punctuation, symbols, numbers, hashtags are removed from the corpus.
Stopwords (short, common words such as “the”) are also removed from the data. The motivation for removing stopwords is that it does not seem to be very helpful to predict short common words that could be easily typed. Because they are so common, the overall accuracy of the model may be significantly worse (depending on the methodology by which accuracy is measured) since the model will never predict a stopword. However, this will hopefully be offset by a higher utility to the user. The decision to include/exclude stopwords can be revisited once a preliminary model has been evaluated.
The dataset can be further reduced by dropping low frequency words. Roughly 85% of the unique words are dropped by removing words with occurence rates of less than 10, but this still retains approximately 94% of all word instances.
## [1] "Total words: 209702"
## [1] "Words retained: 30942"
## [1] "Ratio of unique words retained: 0.148"
## [1] "Ratio of total word instances retained: 0.938"
## will just said one like can get time new good
## 31589 30706 30477 29039 27312 24837 22818 21428 19510 18077
After removing stopwords, profanity, and words with counts less than 10, a histogram of word distributions is generated showing the distribution of word frequencies. There are over 3000 words that occur 10 times, about 500 words that occur 20 times, nearly 300 that occur 30 times, and approximately 200 that occur 40 times. The occurance rates continue to drop off and become sparse at the other end of the distribution. The most frequently occuring words are summarized using the topfeatures() function from the quanteda library.
A similar procedure can be followed to generate lists of bigrams and trigrams. Here the tokenization must be explicitly performed and stopwords and profanity removed prior to generating the data frequency matrices for bigrams and trigrams (see code for details). Singletons, that is bigrams and trigrams occuring less than twice, are removed. Because the bigrams and trigrams are much more diverse than the single words, the instance retention rates are only 46% and 5.6% respectively.
## [1] "Total Bigrams: 3543352"
## [1] "Bigrams retained: 561642"
## [1] "Ratio of unique bigrams retained: 0.159"
## [1] "Ratio of total bigram instances retained: 0.459"
## right_now new_york last_year last_night
## 2596 1981 1838 1655
## high_school feel_like years_ago last_week
## 1456 1346 1325 1288
## first_time looking_forward
## 1229 1180
## [1] "Total Trigrams: 5310281"
## [1] "Trigrams retained: 112222"
## [1] "Ratio of unique trigrams retained: 0.021"
## [1] "Ratio ot total trigrams retatined: 0.056"
## new_york_city let_us_know happy_new_year
## 266 255 200
## happy_mothers_day happy_mother's_day president_barack_obama
## 179 166 158
## cinco_de_mayo new_york_times two_years_ago
## 127 126 121
## looking_forward_seeing
## 115
The bigram and trigram histograms show that the vast majority of the ngrams occur only a handful of times. There are about 400 bigrams that occur 30 times, and about 100 that occur 50 times. The most frequent bigram (“right now”) occurs 2596 times. There are less than 400 trigrams that occur 10 times, and less than 100 that occur 20 times. The most frequent trigram (“new york city”) occurs only 266 times.
The next step in developing an ngram based prediction model is to smooth the ngram probabilities, accounting for unseen ngrams. This smoothing can take several forms, from simple models (+1 added to occurence rates for each token) to more complex models, such as Good-Turing discounting, or Kneyser-Ney smoothing. The authors of the course have also indicated that a Katz back off model be investigated. One or more of these will be implemented in the next phase of the project. Once the probabilities of the bigrams and trigrams are determined, these can be used to predict the next most likely words in a sequence (e.g. given two words, determine the most probable trigrams containing those words, if no matching trigrams exist, back off to bigrams, etc.).
Following are code segments used in the production of this report.
library(tm)
library(quanteda)
library(data.table)
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Download the zip file if it does not exist
if (!file.exists("./swiftkey.zip")) {
download.file(fileUrl, "./swiftkey.zip", method="curl")
}
# Data is unpacked to a directory called "final" -- if it does not exist, then unzip
if (!file.exists("final")) {
unzip("./swiftkey.zip")
}
# English directory
en_dir <- "./final/en_US"
# Load corpus
en <- VCorpus(DirSource(en_dir, encoding="UTF-8"),readerControl = list(language="en_US"))
#convert VCorpus to a quanteda corpus and summarize
enc <- corpus(en)
rm(en)
summary(enc)
if (!file.exists("./samples/sample.blogs.txt")) {
rcon <- file("./final/en_US/en_US.blogs.txt", "rt")
wcon <- file("./samples/sample.blogs.txt", "wt")
sampleFiles(rcon,wcon,0.1)
}
if (!file.exists("./samples/sample.news.txt")) {
rcon <- file("./final/en_US/en_US.news.txt", "rt")
wcon <- file("./samples/sample.news.txt", "wt")
sampleFiles(rcon,wcon,0.1)
}
if (!file.exists("./samples/sample.twitter.txt")) {
rcon <- file("./final/en_US/en_US.twitter.txt", "rt")
wcon <- file("./samples/sample.twitter.txt", "wt")
sampleFiles(rcon,wcon,0.1)
}
# Sample directory
sample_dir <- "./samples"
en <- VCorpus(DirSource(sample_dir, encoding="UTF-8"),readerControl = list(language="en"))
#convert VCorpus to a quanteda corpus and summarize
enc <- corpus(en)
rm(en)
summary(enc)
#Get list of profanity words, remove trailing commas and get rid of header info
prof <- read.csv("./Terms-to-Block.csv")
prof <- apply(prof, 2, function(e) gsub(',','',e))
prof <- prof[4:726,2]
# Use dfm() to get word frequencies, removing stopwords, profanity, punctuation, symbols, numbers, hastags
wordDfm <- dfm(enc, ngram=1, remove = c(stopwords("english"),prof), removePunct=TRUE, removeSymbols=TRUE, removeNumbers=TRUE, removeTwitter=TRUE)
wordDfm <- dfm_sort(wordDfm)
worddt <- as.data.table(wordDfm)
# Counts of individual tokens and total word instances
counts <- apply(worddt,2,function(e) sum(e))
instances <- sum(counts)
# Trim low frequency words
wordTrim <- dfm_trim(wordDfm,min_count=10)
wordTrimdt <- as.data.table(wordTrim)
trimcounts <- apply(wordTrimdt,2,function(e) sum(e))
triminstances <- sum(trimcounts)
retained <- triminstances/instances
paste("Total words: ", length(counts))
paste("Words retained: ", length(trimcounts))
paste("Ratio of unique words retained: ", round(length(trimcounts)/length(counts),3))
paste("Ratio of total word instances retained: ", round(retained,3))
# create histogram of word counts and list top word features
hist(trimcounts,breaks=25000,xlim=c(10,50),ylim=c(0,3500),xlab="Word Counts",main="Histogram of Word Counts")
topfeatures(wordTrim)
#Need to tokenize and remove stopwords and profanity before dfm
tkns <- tokenize(tolower(enc), removePunct=TRUE, removeSymbols=TRUE, removeNumbers=TRUE, removeTwitter=TRUE)
tkns <- removeFeatures(tkns, c(stopwords("english"),prof))
tkns_bigrams <- tokens_ngrams(tkns, 2)
myDfm2 <- dfm(tkns_bigrams)
myDfm2 <- dfm_sort(myDfm2)
rm(tkns_bigrams)
bidf <- as.data.frame(as.matrix(myDfm2))
# Counts of bigram tokens and total bigram instances
bicounts <- apply(bidf,2,function(e) sum(e))
biinstances <- sum(bicounts)
# Trim bigrams with count < 2
biTrim <- dfm_trim(myDfm2,min_count=2)
biTrimdt <- as.data.table(biTrim)
biTrimcounts <- apply(biTrimdt,2,function(e) sum(e))
biTriminstances <- sum(biTrimcounts)
biretained <- biTriminstances/biinstances
paste("Total Bigrams: ", length(bicounts))
paste("Bigrams retained: ", length(biTrimcounts))
paste("Ratio of unique bigrams retained: ", round(length(biTrimcounts)/length(bicounts),3))
paste("Ratio of total bigram instances retained: ", round(biretained,3))
hist(bicounts,breaks=3000,xlim=c(0,50),ylim=c(0,1000),xlab="Bigram Counts",main="Histogram of Bigram Counts")
topfeatures(biTrim)
# Trigrams
tkns_trigrams <- tokens_ngrams(tkns, 3)
myDfm3 <- dfm(tkns_trigrams)
myDfm3 <- dfm_sort(myDfm3)
rm(tkns_trigrams)
tridf <- as.data.frame(as.matrix(myDfm3))
# Counts of trigram tokens and total trigram instances
tricounts <- apply(tridf,2,function(e) sum(e))
triinstances <- sum(tricounts)
# Trim trigrams with count < 2
triTrim <- dfm_trim(myDfm3,min_count=2)
triTrimdf <- as.data.frame(as.matrix(triTrim))
triTrimcounts <- apply(triTrimdf,2,function(e) sum(e))
triTriminstances <- sum(triTrimcounts)
triretained <- triTriminstances/triinstances
paste("Total Trigrams: ", length(tricounts))
paste("Trigrams retained: ", length(triTrimcounts))
paste("Ratio of unique trigrams retained: ", round(length(triTrimcounts)/length(tricounts),3))
paste("Ratio ot total trigrams retatined: ", round(triretained,3))
hist(tricounts,breaks=300,xlim=c(0,50),ylim=c(0,1000),xlab="Trigram Counts",main="Histogram of Trigram Counts")
topfeatures(triTrim)