Data Science Capstone - Milestone Report

Summary

We are creating an interactive web app called “Word! - the next word generator” using the Shiny package in R programming language. For any word or phrase of any length entered by the user, the app will predict and return up to three possible next word options. For example, should the user enter the words what a wonderful, our app will hypothetically return up to three options, such as: day, world, and person.

The unique features of our app:
1. User can input any length of text, even numbers and punctuations
2. The rarest word or phrase will also receive a fairly appropriate response 3. Light on computational resources - no large storage, no heavy calculations

This is an interim milestone report to:
1. Demonstrate we have downloaded and read the text data that will form the vocabulary basis for our word prediction
2. Display summary statistics for the data
3. Report interesting findings in the data
4. Share approach for the final app

NOTE: Please refer the APPENDIX at the end of the report for complete code of the various functions used.

Download and read data

The data is downloaded from the Coursera site provided to us and read into dataframes for further analysis. The data are exerpts from blogs, news, and tweets in the English language and for the basis for our word prediction model.

fileURL = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, "projectdata.zip")
unzip("projectdata.zip")

blogdata = readLines("final/en_US/en_US.blogs.txt")
newsdata = readLines("final/en_US/en_US.news.txt")
twitterdata = readLines("final/en_US/en_US.twitter.txt")

Summary statistics and other exploratory anaysis

Summary statistics

Let’s take a quick look at file sizes (in Mb) and the number of lines and words in the text data read.

# create dataframe to hold information on the datasets
summarydetails = data.frame(dataset = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"))

# file size
summarydetails$filesize.mb = c(round(file.size("final/en_US/en_US.blogs.txt")/1048576,1), round(file.size("final/en_US/en_US.news.txt")/1048576,1),round(file.size("final/en_US/en_US.twitter.txt")/1048576,1))

# number of lines
summarydetails$num.lines = c(length(blogdata), length(newsdata), length(twitterdata))

# number of words
numelements = 0; for(i in 1:length(blogdata)) {lineelements <- length(words(blogdata[i])); numelements = numelements + lineelements}; summarydetails$num.words[1] = numelements
numelements = 0; for(i in 1:length(newsdata)) {lineelements <- length(words(newsdata[i])); numelements = numelements + lineelements}; summarydetails$num.words[2] = numelements
numelements = 0; for(i in 1:length(twitterdata)) {lineelements <- length(words(twitterdata[i])); numelements = numelements + lineelements}; summarydetails$num.words[3] = numelements

# show details
summarydetails

##             dataset filesize.mb num.lines num.words
## 1   en_US.blogs.txt       200.4    899288  37334131
## 2    en_US.news.txt       196.3     77259   2643969
## 3 en_US.twitter.txt       159.4   2360148  30373543

With these file sizes being very large and our computational powers being rather limited, we will use only 1% of all the data to build our prediction model. All further exploratory analysis will also be based on that 1% data.

combineddata = c(blogdata, newsdata, twitterdata)
rm("blogdata"); rm("newsdata"); rm("twitterdata") #to release memory
set.seed(321321) #for reproducibility
# pull 1% random sample
combineddata = sample(combineddata)
p = 0.01
rowtrain = which(rbinom(n = length(combineddata), size = 1, prob = p)==1)
trainset = combineddata[rowtrain]

Further exploratory analysis

We will explore which words and phrases of two and three words are the most common in our dataset. For this purpose, we will create requisite n-grams. That is, parse all the text into component words - unigrams and phrases of two and three continuous words - bigrams and trigrams, respectively. Then we will count the frequency of occurrence of these n-grams in our entire library of text data.

# convert text into Corpus form
traincorpus = Corpus(VectorSource(trainset)); rm("trainset")

# we will clean the corpus, including removing profanities and other undesired elements

fileURL = "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(fileURL, "profanity.txt")
profanity = readLines("profanity.txt")

traincorpus = cleancorpus(traincorpus)

# create n-grams, as well as clean and prune them to remove "non-words" and lighten memory requirements
# unigram 
n1gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 1, max = 1)))); n1gramfreq = processgram(n1gramfreq)

# bigram
n2gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 2, max = 2)))); n2gramfreq = processgram(n2gramfreq); n2gramfreq = prunegram(n2gramfreq)

# trigram
n3gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 3, max = 3)))); n3gramfreq = processgram(n3gramfreq); n3gramfreq = prunegram(n3gramfreq)

Interesting findings in data

From our unigram evaluation, we found that just 0.25% or 111 words of all 43,000+ words in the vocabulary account for 50% of coverage. That is, 50% of all that people write is covered by just these 111 words. Below we see percentage overage offered by each word and cumulative percentage coverage offered by every successive word. For instance, the word the covers 4.3% of all vocabulary. Include the word to and the two words cumulatively account for 7.1% of all vocabulary.

# percentage coverage by each word
n1gramfreq$probability = n1gramfreq$V1 / sum(n1gramfreq$V1)

# cumulative percentage coverage
n1gramfreq$cumulative = n1gramfreq$probability

for(i in 1:length(n1gramfreq$probability)){
    n1gramfreq$cumulative[i] = sum(n1gramfreq$probability[1:i])}

##      V1 V2  V3 probability cumulative
## 1 29880    the  0.04320092 0.04320092
## 2 19228     to  0.02780011 0.07100102
## 3 17453      i  0.02523379 0.09623481
## 4 16244    and  0.02348580 0.11972061
## 5 16047      a  0.02320097 0.14292158
## 6 12939     of  0.01870738 0.16162897

Approach for final app

Our algorithm relies on the Stupid Backoff mechanism with unknown handling using parts-of-speech (POS) tagging. Described simply:

We use a database of n-grams of the order two, three, four, and five built on 1% of the blogs, news, and tweet data offered. The n-grams are broken into two parts: “search phrase”" and “next word”. However many words the user enters as input, we only take up the last four words. We also remove any numbers and punctuations, except intra-word apostrophe and hyphens.

Backoff: We then search for the four words in 5-gram search phrases and return top-three next words. If not found, we search the last three words in 4-gram search phrases and return appropriate next words. This continues until there is only the very last word left. This is searched for in the bigram. If still not found, we resort to:

Unknown handling with POS tagging: With our knowledge of English grammar and sentence construction from the Cambridge Dictionary and a list of most frequently used English words from Wikipedia, we have constructed a list of up to three most frequent words to follow a particular POS. When faced with an unknown last remaining word from the user, that word is tagged as a noun, verb, or other POS as per Penn Treebank tagging. Then against the tag, we return the appropriate next words from our prepared list.

For example, in case of a trigram how are you, our database stores it as search phrase how are and next word you. If the user enters, just wanted to come over and ask how are, we take only and ask how are. We search for this in 5-gram, not found. We search for ask how are in 4-gram, not found. We search how arein 3-gram - found! - return you. Suppose it wasn’t as easy, and the last remaining word was something not already in our corpus, such as telescope, we would run it through our POS tagger, find that it is a noun and then return up to three words most likely to follow a noun from our prepared list.

Quick look at n-gram database and unknown handler file

head(n2to5gram)

##       V2      V3
## 1      a   great
## 2      a     few
## 3      a    good
## 4 a-town veteran
## 5  aaron rodgers
## 6  aaron  rogers

head(unknown)

##   V1    V2
## 1 CC   the
## 2 CC     a
## 3 CC  then
## 4 CD    of
## 5 CD years
## 6 CD  days

Some working examples

# Highly unlikely search input including punctuation and numbers
input = "Kids, guess what?! @8am, Teletubbies"
print(paste("Your input is:", input))

## [1] "Your input is: Kids, guess what?! @8am, Teletubbies"

start = Sys.time(); print(paste("Your next word option(s):", paste(givenext(input), collapse = ", "))); end = Sys.time(); print(paste("Time taken in seconds:", lapse(end, start)))

## [1] "Your next word option(s): are, of, and"

## [1] "Time taken in seconds: 7"

# More common input
input = "How are you"
print(paste("Your input is:", input))

## [1] "Your input is: How are you"

start = Sys.time(); print(paste("Your next word option(s):", paste(givenext(input), collapse = ", "))); end = Sys.time(); print(paste("Time taken in seconds:", lapse(end, start)))

## [1] "Your next word option(s): doing"

## [1] "Time taken in seconds: 0"

# Input with no words
input = "@8!!"
print(paste("Your input is:", input))

## [1] "Your input is: @8!!"

givenext(input)

## [1] "** Input should include at least one word **"

Performance

We saw above that time-wise, the performance is rather good even for unknown words. Also, in terms of memory requirements, the n-gram database to be loaded and list for unknowns are very light, in comparison to the nearly 550+Mb initial text data. This has been achieved by pruning the n-grams. That is, we retain only the top three most frequently occurring next word options for every unique search phrase. Also, we remove all search phrase - next word combinations with frequency of occurrence below the threshold level of 2.

print(object.size(n2to5gram), units="Mb")

## 7.1 Mb

print(object.size(unknown), units="Kb")

## 5.6 Kb

Accuracy can be gauged through this benchmarking method available over GitHub. Testing for 10 tweets and blogs, we find:

## [1] FALSE

## [1] FALSE

benchmark(givenext, sent.list = list('tweets' = tweets, 'blogs' = blogs), ext.output = T)

## Overall top-3 score:     14.09 %
## Overall top-1 precision: 11.24 %
## Overall top-3 precision: 16.55 %
## Average runtime:         321.56 msec
## Number of predictions:   467
## Total memory used:       42.94 MB
## 
## Dataset details
##  Dataset "blogs" (10 lines, 274 words, hash 1049b666a42154ed42c5f4541c0ffe970b3e0123adf828c22d3e6e228f9d1558)
##   Score: 14.86 %, Top-1 precision: 11.99 %, Top-3 precision: 17.60 %
##  Dataset "tweets" (10 lines, 201 words, hash 3a03ed470dcc55538d1190d87c20eb4aa9d1f7d1970a8f356b9ff910788a7b7c)
##   Score: 13.33 %, Top-1 precision: 10.50 %, Top-3 precision: 15.50 %
## 
## 
## R version 3.5.2 (2018-12-20), platform x86_64-w64-mingw32/x64 (64-bit)
## Attached non-base packages:   data.table (v1.11.8), stringi (v1.2.4), digest (v0.6.18), openNLP (v0.2-6), RWeka (v0.4-40), tm (v0.7-6), NLP (v0.2-0)
## Unattached non-base packages: Rcpp (v1.0.0), slam (v0.1-45), grid (v3.5.2), magrittr (v1.5), evaluate (v0.12), RWekajars (v3.9.3-1), openNLPdata (v1.5.3-4), xml2 (v1.2.0), rmarkdown (v1.11), tools (v3.5.2), stringr (v1.3.1), xfun (v0.4), yaml (v2.2.0), parallel (v3.5.2), compiler (v3.5.2), rJava (v0.9-10), htmltools (v0.3.6), knitr (v1.21)

Next steps and considerations

Performace in terms of time and accuracy can be improved, while keeping memory requirements reasonable, by expanding the size of our search database. By finding more search phrases in the database itself, the algorithm can avoid going into the time consuming POS tagging. Results may also be more accurate with greater context involved than just POS information.

For this purpose, we could build an expanded database from scratch with our given news, blog, and tweet text. But this would be too resource consuming. An easier alternative would be to use free ready-made n-grams available from the Corpus of Contemporary American English (COCA). It may be beneficial to include COCA’s 2-gram data to bolster our final search frontier to ensure fewer visits to the unknown handling segment.

APPENDIX

Clean corpus

# clean corpus content

cleancorpus <- function(x){

# remove non-ASCII characters such as ×¤ö¼
x$content = iconv(x$content, from = "UTF-8", to = "ASCII", sub = "")

# convert all text to lower case, remove numbers

x = tm_map(x, content_transformer(tolower))
x = tm_map(x, removeNumbers)

#remove twitter handles, hashtags and URLs (http, https, ftp) 

x$content = gsub(" @\\S*", " ", x$content) 
x$content = gsub(" #\\S*", " ", x$content)  
x$content = gsub("(f|ht)(tp)(s?)(://)(\\S*)", " ", x$content) 

# remove punctuations (preserving apostrophes and intra-word hyphens), profanity, extra white spaces
# we do not intend to stem the words or remove stopwords

x = tm_map(x, removePunctuation, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE)
x = tm_map(x, removeWords, profanity)
x = tm_map(x, stripWhitespace)

x
}

Process Gram

processgram <- function(x){

# split each string into component words
if(ncol(x)==2){x <- data.frame(x$Freq, do.call(rbind, strsplit(as.character(x$Var1), split = " ")))}

# clean n-gram to remove "words" that are:
# only single letters such as "s", "m" (except "a" and "i", which are words in themselves)
# starting with hyphen
# repetition of single letters such as "aaaa", "bb"
# "ve"

for(i in 2:ncol(x)){
x <- x[-c(which(x[,i] %in% letters[-c(1,9)]), grep("^-|^([a-z])\\1+$|^ve$", x[,i])),]
}

# paste together all columns except the last to make the "search phrase"
# the last column will form the "next word"
x = data.frame(V1 = x[,1], V2 = apply(as.data.frame(x[,-c(1,ncol(x))]), 1, paste, collapse = " "), V3 = x
[,ncol(x)])

# convert non-frequency columns to character
for(i in 2:3){x[,i] = as.character(x[,i])}

# sort
x = x[order(x$V2, -x$V1),]

# return
x
}

Prune Gram

# prune n-gram to lighten memory requirement by: 
# retaining only top-3 "next word" for any unique value of search,
# removing entries with frequency of just 1, and dropping frequency column

prunegram <- function(x){
x = x[ave(x$V1, x$V2, FUN = seq_along) <= 3, ]
x[which(x$V1>1),c(2,3)]
}

Create n-gram database

# we have already read in text data, created a corpus, cleaned it, 
# also obtained bigrams and trigrams
# hence only creating 4-grams and 5-grams

n4gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 4, max = 4)))); n4gramfreq = processgram(n4gramfreq); n4gramfreq = prunegram(n4gramfreq)

n5gramfreq = data.frame(table(NGramTokenizer(traincorpus$content, Weka_control(min = 5, max = 5)))); n5gramfreq = processgram(n5gramfreq); n5gramfreq = prunegram(n5gramfreq)

# combine into a single database
n2to5gram = rbind(n2gramfreq, n3gramfreq, n4gramfreq, n5gramfreq)
rm("n2gramfreq"); rm("n3gramfreq"); rm("n4gramfreq"); rm("n5gramfreq")
write.csv(n2to5gram, "n2to5gram.csv", row.names = FALSE)

Load POS information

unknown = read.csv("postags.csv", header = FALSE, stringsAsFactors = FALSE)

Accept and clean input

# clean the input text to remove numeric
# and punctuation data (except intraword apostrophe and hyphens), 
# convert to lower case, and then split into words to search
# if input text is longer than 4 words, retain only last 4 for search

cleaninput <- function(x){

x = gsub("\\d+", "", x)
x = gsub("\\s*(?:(?:\\B[-']+|[-']+\\B|[^-'[:^punct:]]+)\\s*)+", " ", x, perl = TRUE)
x = tolower(x)
x = strsplit(as.character(x), split = " ")[[1]]
x = x[x != ""]

ifelse(length(x) > 4, x <- tail(x,4), x <- x)

x

}

Search the input phrase and return next words

# return upto three possible "next word" options
# using a backoff model
# unknowns handled with a combination of Penn Treebank tagging
# and returning most common possible next words

givenext <- function(x){
x = cleaninput(x)
if(length(x)==0){return(found <- "** Input should include at least one word **")}

startpoint = 1
found = NA
seek = NA

#Backoff
while(is.na(found[1]) & startpoint <= length(x)){
seek = x[startpoint:length(x)]
found = searchxiny(seek, n2to5gram)
startpoint = startpoint + 1
}

#Unknown handling
if(is.na(found[1])){
pos =  tagPOS(seek)
found <- searchxiny(pos, unknown)
}else{
found <- found[!is.na(found)]
}

found

}

# search for input text in n-gram database
# return "next word" options

searchxiny <- function(x, y){
found = y[y[,1] == paste(x, collapse = " "), 2]
}

# customized tagPOS() from openNLP
tagPOS <-  function(x, ...) {
  s <- as.String(x)
  word_token_annotator <- Maxent_Word_Token_Annotator()
  a2 <- Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- annotate(s, word_token_annotator, a2)
  a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
  
  gc() #release memory to avoid Out of Memory error

  a3[[length(a3)]]$features[[1]][[1]][1]

}