Using over 4 million text entries, combined from twitter, blog posts, and news articles, we will create a predictive text engine in a Shiny App.
The steps will be as followed:
NOTE: In the Final Shiny APP due to ShinyApps.io memory limits The app’s version of thebigcorpus is derived from the following “code” set.seed(4) x <- sample(thebigcorpus,100000) dput(x,file = “thebigcorpus”)
Let’s Begin.
This may take between 1-3 minutes depending on your system.
We download, unzip, extract the ENGLISH language files, and combine the twitter, blog, and news posting into one corpus. I am going to diverge from others here and NOT format it into a data.table and NOT do any sampling or arbitrary weighting.
Using the data.table approach is useful for the prediction engine, but much slower for the indexing of which entries are relevant to us.
Sampling is great, but I want to be able to return predictions even for rarer English phrases. And of course, the “Unreasonable Effectiveness of Data” applies.
I feel that weighting blog posts or news articles higher than tweets has a strong argument, as twitter’s 140 character limit incentivizes users to use novel and shorthand english (idk, lol, nbd, idc, wyd… the list goes on). I am going to rely on the lazy algorithm approach to ignore those unless prevalent. Weighting will be done using a Katz Backoff approach, described later.
zipURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(zipURL, destfile = "./swiftzip.zip") #548 MB
unzip("./swiftzip.zip", list = TRUE) #get the file names don't extract
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
filepth <- getwd() # get the working directory
unzip("./swiftzip.zip", exdir = filepth) #unzip the files into the working directory
#inside the directory is now a new directory called "final"
#NOTE if data are already downloaded, start reupload from this line -----
#inside "final" is 4 folders, 1 for each language
setdirectory <- "./final/en_US" #specify the directory with the ENGLISH US files
setwd(setdirectory)
#twitter, news, and blog txt files
twittertxt <- readLines("en_US.twitter.txt", skipNul = TRUE) #twitter data
blogtxt <- readLines("en_US.blogs.txt", skipNul = TRUE) #blog data
newscon <- file("en_US.news.txt",'rb') #news data
newstxt <- readLines(newscon)
close(newscon)
#News was downloaded in a binary read due to special chars, nulls, and SUBs in the data
rm(newscon) #clean environment
thebigcorpus <- c(twittertxt,blogtxt,newstxt) #combine it all
rm(twittertxt,blogtxt,newstxt) #remove duplicates
rm(filepth,setdirectory,zipURL) #clean environment
Our example phrase to help us think about the process: “I want to go to the beach because I like swimming.”
This function removes common english words such as: I, a, the, of, did, not… The full list of words is available in the tm package in the stopwords(kind=“en”) function.
This function, applied to our example phrase, would output: “want go beach like swimming” This is great for connecting “go”" to the word “beach”" or “beach”" to the word “swimming”" in as few words as possible, it also cuts the data in size while losing minimal meaning.
NOTE: This function works great for single strings, but for large characters (1000+ elements) it’s too slow. We need to create a vectorized version - i.e. remove.vector.stopwords()
countwords() - counts the number of words in a phrase
removenwords() - removes a desired number (n) of words from a phrase, starting at the first one.
getlastnwords() - retrieves a desired number (n) of words from a phrase, starting from the last one This function is used to get our prediction word. For example, “want go beach like _________” hypothetically, could return “sun”, “swim”, “swimming”, “drink”, “tan”… we’re going to be trusting the quantity of data to best predict words.
makeInternational() - replaces foreign letters such as “ê” with “e”. Can handle large elements, does NOT need to be looped/applied.
thebigclean() - cleans the input by forcing lowercase, changing foreign characters, removing punctuation, numbers, stopwords, and extra spaces. Each of these arguments is optional, but recommended. -NOTE: It is critical that the order of the functions be as written above: lowercase, then international, then punctuation, then numbers, then stopwords, then extra spaces last.
katzbackoff.grep() - The powerhouse function. It takes an input, then indexes the corpus to find which entries contain the input. It then repeats the process by removing words to increase flexibility (default is everything but last two words, this can be changed). Entries that contain more of the words are weighted by duplication. The duplicates can be erased if desired, but this is mathematically equivalent to only searching a single word (the last word) and returning the most common next word.
library(tm)
## Loading required package: NLP
removestopwords <- function(inputtext, language = "en", lowercase = TRUE){
#This version is good for single strings, like inputtext and makes lowercase
#It allows for punctuation, I won't be using it, but it may be useful to you.
stops <- stopwords(kind = language) #from the tm package
inputtext <- strsplit(inputtext, split = " ") #list of length 1
if(lowercase == TRUE) #the stopwords are lowercase, so be careful
inputtext <- tolower(inputtext[[1]]) #character class and lowercase
else inputtext <- inputtext[[1]]
inputtext <- subset(inputtext,
!(inputtext %in% stops)) #subset, NOT stopwords
return(paste0(inputtext, collapse = " ")) #back in string form
}
remove.vector.stopwords <- function(inputcorp) {
library(tm)
#This is the fast version for removing stopwords from large character classes
#make sure to remove punctuation, numbers, symbols, foreign letters, FIRST
badwords <- stopwords(kind="en") #get the 174 stopwords
badwords <- gsub("\'","",badwords) #remove apostrophes
#again, punctuation should already be removed
inputcorp <- removeWords(inputcorp,badwords)
return(inputcorp)
#Remove extra whitespace LAST: see thebigclean()
}
countwords <- function(inputtext){
inputtext <- strsplit(inputtext, split = " ") #list
inputtext <- inputtext[[1]] #character class
baselength <- length(inputtext)
return(baselength)
}
removenwords <- function(inputtext, num.words.to.remove = 1){
inputtext <- strsplit(inputtext, split = " ") #list
inputtext <- inputtext[[1]] #character class
baselength <- length(inputtext)
if(num.words.to.remove >= baselength)
return("you erased everything!")
inputtext <- inputtext[(num.words.to.remove+1):baselength] #starts at n+1
return(paste0(inputtext, collapse = " ")) #back in string form
}
getlastnwords <- function(inputtext, num.words.to.get = 1){
inputtext <- strsplit(inputtext, split = " ") #list
inputtext <- inputtext[[1]] #character class
baselength <- length(inputtext)
if(num.words.to.get > baselength)
return("stop, you're getting everything!")
inputtext <- inputtext[(baselength + 1 - num.words.to.get):baselength]
return(paste0(inputtext, collapse = " "))
}
makeInternational <- function(inputtext){
inputtext <- iconv(inputtext, to="ASCII//TRANSLIT")
#makes foreign letters into standard letters, some symbols become punctuation
#NOTE this does NOT need to be looped over large character classes, just input all of it
return(inputtext)
}
thebigclean <- function(inputtext, removestops = TRUE,lowercase = TRUE,
punctuation = TRUE, numbers = TRUE, international =TRUE){
library(tm)
if(lowercase == TRUE)
inputtext <- tolower(inputtext) #works on large characters
if(international == TRUE)
inputtext <- makeInternational(inputtext) #works on large characters
if(punctuation == TRUE)
inputtext <- gsub("[[:punct:]]", "", inputtext) #removes punctuation
#works on large characters
if(numbers == TRUE)
inputtext <- gsub("[0-9]","",inputtext) #removes numbers
#works on large characters
if(removestops == TRUE)
inputtext <- remove.vector.stopwords(inputtext)
#removes stopwords (punctuation adjusted) using tm package, faster than a loop
inputtext <- gsub("\\s+"," ",inputtext) #removes extra spaces from above, DO THIS LAST!
#works on large characters
return(inputtext)
}
katzbackoff.grep <- function(inputtext,thecorpus,rmvstops = TRUE,minwords = 2,duplicates = TRUE){
workingtext <- thebigclean(inputtext, removestops = rmvstops) #cleans the input text
thecorpusindex = NULL #creates an empty index vector
#Note for speed, we want to immediately trim thecorpus by the minimum words.
#i.e There is no point in searching the entire corpus for "go beach like" if
#our minimum is that every entry has "beach like"
#this backward search allows for faster weighting of "go beach like", "want go beach like"... etc
minimum.criteria <- getlastnwords(workingtext, num.words.to.get = minwords)
thecorpus <- grep(minimum.criteria, thecorpus, value = TRUE) #only relevant corpus is searched for weighting #TRUE is to get the element NOT the index
repeat{ #repeat until the minimum words allowed are reached
newindex <- grep(workingtext,thecorpus) #search for the text in the corpus
thecorpusindex <- c(thecorpusindex, newindex) #add it to the index
workingtext <- removenwords(workingtext, 1) #take off the first word and repeat
if(countwords(workingtext) < minwords){ #finish after minwords has been run.
break
}
if(workingtext == "error, try more input?"){ #if your input text has too few non-stopwords
break
}
}
if(duplicates == FALSE){ #it is better to allow duplicates for weighting
thecorpusindex <- unique(thecorpusindex) #equivalent to searching last non-stop word
}
return(thecorpusindex) #returns an index of relevant entries, duplicated as default
}
I want the data to:
A 100,000 element sample of thebigcorpus took 1.4 seconds to makeInternational. A 100,000 element sample of thebigcorpus took 3.4 seconds to do thebigclean (stopwords not removed) A 100,000 element sample of thebigcorpus took 5 seconds to remove stopwords A 100,000 element sample of the bigcorpus took 10 seconds to do the entire thebigclean() process ordered appropriately.
For stability purposes, we’ll do each process on thebigcorpus individually.
thebigcorpus <- tolower(thebigcorpus) #makes lowercase; 37 seconds
thebigcorpus <- makeInternational(thebigcorpus) #changes foreign characters; 60 seconds
thebigcorpus <- gsub("[[:punct:]]", "",thebigcorpus) #erases punctuation; 42 seconds
thebigcorpus <- gsub("[0-9]","",thebigcorpus) #erases numbers; 21 seconds
thebigcorpus <- remove.vector.stopwords(thebigcorpus) #removes stopwords; 209 seconds
thebigcorpus <- gsub("\\s+"," ",thebigcorpus) #replaces all series (1+) of spaces with a single space; 63 seconds
For the Shiny app, the output of this cleaning process (which will be smaller in memory) will be loaded with the app for speed purposes.
With a weighted index of relevant entries (katzbackoff.grep), we filter the corpus using the relevant entries (weighted as duplicates) and get all the n-grams that begin with the last n words of our input text.
For example, “I want to go to the beach because I like +predictedword”, becomes “want go beach like +predictword”, the bigrams are “like + predictedword”, the trigrams are “beach like +predictedword”.
Note that the default in the katzbackoff.grep is 2 words, which would be trigram searches. “beach like …” is the minimum that every relevant entry contains.
We then extract all the predictions (+predictword) and compute their proportion/probability and return it to the app user.
#NOTE This model runs all previous functions with their default values
#Strongly recommend to leave the minimum number of words at 2 or more to increase speed.
library(tidytext) #for n-grams
library(dplyr) #masks filter for subsetting
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
textpred <- function(inputtext, thecorpus = thebigcorpus, minimum.n = 2, number.of.predictions = 3){
if(countwords(inputtext) > 8)
print("Please be patient with longer inputs")
workingtext <- thebigclean(inputtext) #cleans everything as the corpus was cleaned
theindex <- katzbackoff.grep(workingtext,thecorpus, minwords = minimum.n) #duplicates default
#"I want to go to the beach because I like" takes 6.04 seconds to repeat search the entire corpus
#NOTE This returns an index based on a corpus meeting the minimum criteria (see the function)
if(length(theindex) == 0){
return("Input too short or too rare")
}
#This code is taken from the katzbackoff.grep code
# "theindex"" above matches the corpus "initialcorpus" created here:
minimum.criteria <- getlastnwords(workingtext, num.words.to.get = minimum.n)
initialcorpus <- grep(minimum.criteria, thecorpus, value = TRUE)
#initial corpus is the elements of the original corpus that meet the minimum criteria
therelevantcorpus <- initialcorpus[theindex] #
#NOTE: The duplication is key- although we search using the minimum number of words
# Entries with more keywords are weighted by being duplicated
# every entry meets the minium word criteria, per the katz backoff method
therelevantcorpus <- data.frame(therelevantcorpus,
stringsAsFactors = FALSE) #for tokening
#makes a data frame of only the relevant entries of the corpus
ngram.table <- unnest_tokens(therelevantcorpus, ngram,
therelevantcorpus,
token = "ngrams",
n = minimum.n + 1)
#uses the minimum number of words to make n-grams from the relevant corpus
# minimum words from the input + a prediction word
keywords <- getlastnwords(workingtext, minimum.n) #gets the ending of the input
keywords <- paste0("^",keywords, collapse = "") #makes it searchable in grepl
ngram.table <- filter(ngram.table, grepl(keywords,ngram)) #finds it
#NOTE grepl returns TRUE/FALSE as opposed to grep which returns indexes or values of true
#to predict a new word, we filter by ngrams that begin with the minimum.n of the workingtext, counting backwards based on the minimum number of words used to search with.
#NOTE: if the last word in your search has alternative forms
#such as {like, likely, liked}, they're included - A way around this is to STEM words
#a process not used here
#now we simply split the grams into the correct number of columns and return a table of the final column
#e.g. our predicted word
predtable <- sort(table(sapply(ngram.table$ngram,getlastnwords, USE.NAMES = FALSE)), decreasing = TRUE)
#cuts the names for aesthetic, only shows the predicted word
predtable <- round(predtable / sum(predtable),2) #gets proportion
if(length(predtable) < number.of.predictions){
number.of.predictions <- length(predtable)}
#prevents NAs from being shown if the requested number of predictions is too high
return(predtable[1:number.of.predictions])
}
A few test cases I thought be interesting:
“I can’t wait to meet someone ____________”
“I don’t know what I want to eat today, maybe I’ll eat ___________”
“After I left the airport I took a ____________”
“I’ve been feeling sick all week, maybe I should go to the _____________”
The default will be to return the top 3 results for each test, using a minimum of the last two (non-stop) words.
test1 <- "I can't wait to meet someone"
test2 <- "I don't know what I want to eat today, maybe I'll eat"
test3 <- "After I left the airport I took a"
test4 <- "I've been feeling sick all week, maybe I should go to the"
textpred(test1)
##
## new else like
## 0.09 0.04 0.03
textpred(test2)
## [1] "Please be patient with longer inputs"
##
## lettuce makeup much
## 0.06 0.06 0.06
textpred(test3)
##
## diverted full house
## 0.17 0.17 0.17
textpred(test4)
## [1] "Please be patient with longer inputs"
##
## thing back idea
## 0.04 0.03 0.02