We are creating an interactive web app called “Word! - the next word generator” using the Shiny package in R programming language. For any word or phrase of any length entered by the user, the app will predict and return up to three possible next word options. For example, should the user enter the words what a wonderful, our app will hypothetically return up to three options, such as: day, world, and person.
The unique features of our app:
1. User can input any length of text, even numbers and punctuations
2. The rarest word or phrase will also receive a fairly appropriate response 3. Light on computational resources - no large storage, no heavy calculations
This is an interim milestone report to:
1. Demonstrate we have downloaded and read the text data that will form the vocabulary basis for our word prediction
2. Display summary statistics for the data
3. Report interesting findings in the data
4. Share approach for the final app
NOTE: Please refer the APPENDIX at the end of the report for complete code of the various functions used.
The data is downloaded from the Coursera site provided to us and read into dataframes for further analysis. The data are exerpts from blogs, news, and tweets in the English language and for the basis for our word prediction model.
fileURL = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, "projectdata.zip")
unzip("projectdata.zip")
blogdata = readLines("final/en_US/en_US.blogs.txt")
newsdata = readLines("final/en_US/en_US.news.txt")
twitterdata = readLines("final/en_US/en_US.twitter.txt")
Let’s take a quick look at file sizes (in Mb) and the number of lines and words in the text data read.
# create dataframe to hold information on the datasets
summarydetails = data.frame(dataset = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"))
# file size
summarydetails$filesize.mb = c(round(file.size("final/en_US/en_US.blogs.txt")/1048576,1), round(file.size("final/en_US/en_US.news.txt")/1048576,1),round(file.size("final/en_US/en_US.twitter.txt")/1048576,1))
# number of lines
summarydetails$num.lines = c(length(blogdata), length(newsdata), length(twitterdata))
# number of words
numelements = 0; for(i in 1:length(blogdata)) {lineelements <- length(words(blogdata[i])); numelements = numelements + lineelements}; summarydetails$num.words[1] = numelements
numelements = 0; for(i in 1:length(newsdata)) {lineelements <- length(words(newsdata[i])); numelements = numelements + lineelements}; summarydetails$num.words[2] = numelements
numelements = 0; for(i in 1:length(twitterdata)) {lineelements <- length(words(twitterdata[i])); numelements = numelements + lineelements}; summarydetails$num.words[3] = numelements
# show details
summarydetails
## dataset filesize.mb num.lines num.words
## 1 en_US.blogs.txt 200.4 899288 37334131
## 2 en_US.news.txt 196.3 77259 2643969
## 3 en_US.twitter.txt 159.4 2360148 30373543
With these file sizes being very large and our computational powers being rather limited, we will use only 1% of all the data to build our prediction model. All further exploratory analysis will also be based on that 1% data.
combineddata = c(blogdata, newsdata, twitterdata)
rm("blogdata"); rm("newsdata"); rm("twitterdata") #to release memory
set.seed(321321) #for reproducibility
# pull 1% random sample
combineddata = sample(combineddata)
p = 0.01
rowtrain = which(rbinom(n = length(combineddata), size = 1, prob = p)==1)
trainset = combineddata[rowtrain]
We will explore which words and phrases of two and three words are the most common in our dataset. For this purpose, we will create requisite n-grams. That is, parse all the text into component words - unigrams and phrases of two and three continuous words - bigrams and trigrams, respectively. Then we will count the frequency of occurrence of these n-grams in our entire library of text data.
# convert text into Corpus form
traincorpus = Corpus(VectorSource(trainset)); rm("trainset")
# we will clean the corpus, including removing profanities and other undesired elements
fileURL = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(fileURL, "profanity.txt")
profanity = readLines("profanity.txt")
traincorpus = cleancorpus(traincorpus)
# create n-grams, as well as clean and prune them to remove "non-words" and lighten memory requirements
# unigram
n1gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 1, max = 1)))); n1gramfreq = processgram(n1gramfreq)
# bigram
n2gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 2, max = 2)))); n2gramfreq = processgram(n2gramfreq); n2gramfreq = prunegram(n2gramfreq)
# trigram
n3gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 3, max = 3)))); n3gramfreq = processgram(n3gramfreq); n3gramfreq = prunegram(n3gramfreq)
From our unigram evaluation, we found that just 0.25% or 111 words of all 43,000+ words in the vocabulary account for 50% of coverage. That is, 50% of all that people write is covered by just these 111 words. Below we see percentage overage offered by each word and cumulative percentage coverage offered by every successive word. For instance, the word the covers 4.3% of all vocabulary. Include the word to and the two words cumulatively account for 7.1% of all vocabulary.
# percentage coverage by each word
n1gramfreq$probability = n1gramfreq$V1 / sum(n1gramfreq$V1)
# cumulative percentage coverage
n1gramfreq$cumulative = n1gramfreq$probability
for(i in 1:length(n1gramfreq$probability)){
n1gramfreq$cumulative[i] = sum(n1gramfreq$probability[1:i])}
## V1 V2 V3 probability cumulative
## 1 29880 the 0.04320092 0.04320092
## 2 19228 to 0.02780011 0.07100102
## 3 17453 i 0.02523379 0.09623481
## 4 16244 and 0.02348580 0.11972061
## 5 16047 a 0.02320097 0.14292158
## 6 12939 of 0.01870738 0.16162897
Our algorithm relies on the Stupid Backoff mechanism with unknown handling using parts-of-speech (POS) tagging. Described simply:
We use a database of n-grams of the order two, three, four, and five built on 1% of the blogs, news, and tweet data offered. The n-grams are broken into two parts: “search phrase”" and “next word”. However many words the user enters as input, we only take up the last four words. We also remove any numbers and punctuations, except intra-word apostrophe and hyphens.
Backoff: We then search for the four words in 5-gram search phrases and return top-three next words. If not found, we search the last three words in 4-gram search phrases and return appropriate next words. This continues until there is only the very last word left. This is searched for in the bigram. If still not found, we resort to:
Unknown handling with POS tagging: With our knowledge of English grammar and sentence construction from the Cambridge Dictionary and a list of most frequently used English words from Wikipedia, we have constructed a list of up to three most frequent words to follow a particular POS. When faced with an unknown last remaining word from the user, that word is tagged as a noun, verb, or other POS as per Penn Treebank tagging. Then against the tag, we return the appropriate next words from our prepared list.
For example, in case of a trigram how are you, our database stores it as search phrase how are and next word you. If the user enters, just wanted to come over and ask how are, we take only and ask how are. We search for this in 5-gram, not found. We search for ask how are in 4-gram, not found. We search how arein 3-gram - found! - return you. Suppose it wasn’t as easy, and the last remaining word was something not already in our corpus, such as telescope, we would run it through our POS tagger, find that it is a noun and then return up to three words most likely to follow a noun from our prepared list.
head(n2to5gram)
## V2 V3
## 1 a great
## 2 a few
## 3 a good
## 4 a-town veteran
## 5 aaron rodgers
## 6 aaron rogers
head(unknown)
## V1 V2
## 1 CC the
## 2 CC a
## 3 CC then
## 4 CD of
## 5 CD years
## 6 CD days
# Highly unlikely search input including punctuation and numbers
input = "Kids, guess what?! @8am, Teletubbies"
print(paste("Your input is:", input))
## [1] "Your input is: Kids, guess what?! @8am, Teletubbies"
start = Sys.time(); print(paste("Your next word option(s):", paste(givenext(input), collapse = ", "))); end = Sys.time(); print(paste("Time taken in seconds:", lapse(end, start)))
## [1] "Your next word option(s): are, of, and"
## [1] "Time taken in seconds: 7"
# More common input
input = "How are you"
print(paste("Your input is:", input))
## [1] "Your input is: How are you"
start = Sys.time(); print(paste("Your next word option(s):", paste(givenext(input), collapse = ", "))); end = Sys.time(); print(paste("Time taken in seconds:", lapse(end, start)))
## [1] "Your next word option(s): doing"
## [1] "Time taken in seconds: 0"
# Input with no words
input = "@8!!"
print(paste("Your input is:", input))
## [1] "Your input is: @8!!"
givenext(input)
## [1] "** Input should include at least one word **"
We saw above that time-wise, the performance is rather good even for unknown words. Also, in terms of memory requirements, the n-gram database to be loaded and list for unknowns are very light, in comparison to the nearly 550+Mb initial text data. This has been achieved by pruning the n-grams. That is, we retain only the top three most frequently occurring next word options for every unique search phrase. Also, we remove all search phrase - next word combinations with frequency of occurrence below the threshold level of 2.
print(object.size(n2to5gram), units="Mb")
## 7.1 Mb
print(object.size(unknown), units="Kb")
## 5.6 Kb
Accuracy can be gauged through this benchmarking method available over GitHub. Testing for 10 tweets and blogs, we find:
## [1] FALSE
## [1] FALSE
benchmark(givenext, sent.list = list('tweets' = tweets, 'blogs' = blogs), ext.output = T)
## Overall top-3 score: 14.09 %
## Overall top-1 precision: 11.24 %
## Overall top-3 precision: 16.55 %
## Average runtime: 321.56 msec
## Number of predictions: 467
## Total memory used: 42.94 MB
##
## Dataset details
## Dataset "blogs" (10 lines, 274 words, hash 1049b666a42154ed42c5f4541c0ffe970b3e0123adf828c22d3e6e228f9d1558)
## Score: 14.86 %, Top-1 precision: 11.99 %, Top-3 precision: 17.60 %
## Dataset "tweets" (10 lines, 201 words, hash 3a03ed470dcc55538d1190d87c20eb4aa9d1f7d1970a8f356b9ff910788a7b7c)
## Score: 13.33 %, Top-1 precision: 10.50 %, Top-3 precision: 15.50 %
##
##
## R version 3.5.2 (2018-12-20), platform x86_64-w64-mingw32/x64 (64-bit)
## Attached non-base packages: data.table (v1.11.8), stringi (v1.2.4), digest (v0.6.18), openNLP (v0.2-6), RWeka (v0.4-40), tm (v0.7-6), NLP (v0.2-0)
## Unattached non-base packages: Rcpp (v1.0.0), slam (v0.1-45), grid (v3.5.2), magrittr (v1.5), evaluate (v0.12), RWekajars (v3.9.3-1), openNLPdata (v1.5.3-4), xml2 (v1.2.0), rmarkdown (v1.11), tools (v3.5.2), stringr (v1.3.1), xfun (v0.4), yaml (v2.2.0), parallel (v3.5.2), compiler (v3.5.2), rJava (v0.9-10), htmltools (v0.3.6), knitr (v1.21)
Performace in terms of time and accuracy can be improved, while keeping memory requirements reasonable, by expanding the size of our search database. By finding more search phrases in the database itself, the algorithm can avoid going into the time consuming POS tagging. Results may also be more accurate with greater context involved than just POS information.
For this purpose, we could build an expanded database from scratch with our given news, blog, and tweet text. But this would be too resource consuming. An easier alternative would be to use free ready-made n-grams available from the Corpus of Contemporary American English (COCA). It may be beneficial to include COCA’s 2-gram data to bolster our final search frontier to ensure fewer visits to the unknown handling segment.
# clean corpus content
cleancorpus <- function(x){
# remove non-ASCII characters such as פö¼
x$content = iconv(x$content, from = "UTF-8", to = "ASCII", sub = "")
# convert all text to lower case, remove numbers
x = tm_map(x, content_transformer(tolower))
x = tm_map(x, removeNumbers)
#remove twitter handles, hashtags and URLs (http, https, ftp)
x$content = gsub(" @\\S*", " ", x$content)
x$content = gsub(" #\\S*", " ", x$content)
x$content = gsub("(f|ht)(tp)(s?)(://)(\\S*)", " ", x$content)
# remove punctuations (preserving apostrophes and intra-word hyphens), profanity, extra white spaces
# we do not intend to stem the words or remove stopwords
x = tm_map(x, removePunctuation, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE)
x = tm_map(x, removeWords, profanity)
x = tm_map(x, stripWhitespace)
x
}
processgram <- function(x){
# split each string into component words
if(ncol(x)==2){x <- data.frame(x$Freq, do.call(rbind, strsplit(as.character(x$Var1), split = " ")))}
# clean n-gram to remove "words" that are:
# only single letters such as "s", "m" (except "a" and "i", which are words in themselves)
# starting with hyphen
# repetition of single letters such as "aaaa", "bb"
# "ve"
for(i in 2:ncol(x)){
x <- x[-c(which(x[,i] %in% letters[-c(1,9)]), grep("^-|^([a-z])\\1+$|^ve$", x[,i])),]
}
# paste together all columns except the last to make the "search phrase"
# the last column will form the "next word"
x = data.frame(V1 = x[,1], V2 = apply(as.data.frame(x[,-c(1,ncol(x))]), 1, paste, collapse = " "), V3 = x
[,ncol(x)])
# convert non-frequency columns to character
for(i in 2:3){x[,i] = as.character(x[,i])}
# sort
x = x[order(x$V2, -x$V1),]
# return
x
}
# prune n-gram to lighten memory requirement by:
# retaining only top-3 "next word" for any unique value of search,
# removing entries with frequency of just 1, and dropping frequency column
prunegram <- function(x){
x = x[ave(x$V1, x$V2, FUN = seq_along) <= 3, ]
x[which(x$V1>1),c(2,3)]
}
# we have already read in text data, created a corpus, cleaned it,
# also obtained bigrams and trigrams
# hence only creating 4-grams and 5-grams
n4gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 4, max = 4)))); n4gramfreq = processgram(n4gramfreq); n4gramfreq = prunegram(n4gramfreq)
n5gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 5, max = 5)))); n5gramfreq = processgram(n5gramfreq); n5gramfreq = prunegram(n5gramfreq)
# combine into a single database
n2to5gram = rbind(n2gramfreq, n3gramfreq, n4gramfreq, n5gramfreq)
rm("n2gramfreq"); rm("n3gramfreq"); rm("n4gramfreq"); rm("n5gramfreq")
write.csv(n2to5gram, "n2to5gram.csv", row.names = FALSE)
unknown = read.csv("postags.csv", header = FALSE, stringsAsFactors = FALSE)
# clean the input text to remove numeric
# and punctuation data (except intraword apostrophe and hyphens),
# convert to lower case, and then split into words to search
# if input text is longer than 4 words, retain only last 4 for search
cleaninput <- function(x){
x = gsub("\\d+", "", x)
x = gsub("\\s*(?:(?:\\B[-']+|[-']+\\B|[^-'[:^punct:]]+)\\s*)+", " ", x, perl = TRUE)
x = tolower(x)
x = strsplit(as.character(x), split = " ")[[1]]
x = x[x != ""]
ifelse(length(x) > 4, x <- tail(x,4), x <- x)
x
}
# return upto three possible "next word" options
# using a backoff model
# unknowns handled with a combination of Penn Treebank tagging
# and returning most common possible next words
givenext <- function(x){
x = cleaninput(x)
if(length(x)==0){return(found <- "** Input should include at least one word **")}
startpoint = 1
found = NA
seek = NA
#Backoff
while(is.na(found[1]) & startpoint <= length(x)){
seek = x[startpoint:length(x)]
found = searchxiny(seek, n2to5gram)
startpoint = startpoint + 1
}
#Unknown handling
if(is.na(found[1])){
pos = tagPOS(seek)
found <- searchxiny(pos, unknown)
}else{
found <- found[!is.na(found)]
}
found
}
# search for input text in n-gram database
# return "next word" options
searchxiny <- function(x, y){
found = y[y[,1] == paste(x, collapse = " "), 2]
}
# customized tagPOS() from openNLP
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
gc() #release memory to avoid Out of Memory error
a3[[length(a3)]]$features[[1]][[1]][1]
}