The problem was to create a text prediction system using text taken from online news, blogs and Twitter. This presentation will describe the steps taken to accomplish this as well as describe the final algorithm.
The data was provided by Swiftkey in multiple languages. Only the English data was used for this project.
The data is read in from the three source text files and then profanity is removed:
blog <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
# Remove profanity
profanity <- grepl(" +[Ff]uck|[Ss]hit|[Cc]unt|[Aa]sshole +", twitter)
twitter <- twitter[!profanity]
profanity <- grepl(" +[Ff]uck|[Ss]hit|[Cc]unt|[Aa]sshole +", blog)
blog <- blog[!profanity]
The data is then subsetted to a manageable size and punctuation and numbers are removed.
set.seed(123455)
# Subset twitter data
rfilter <- rbinom(length(twitter), size=1, prob=0.25)
filter <- rfilter == 1
twitter <- twitter[filter]
# subset blog data
rfilter <- rbinom(length(blog), size=1, prob=0.25)
filter <- rfilter == 1
blog <- blog[filter]
# Remove numbers and punctuation
news <- cleanData(news)
twitter <- cleanData(twitter)
blog <- cleanData(blog)
Finally the data are combined into one vector and non-ASCII characters are removed:
text = c(news, twitter, blog)
# Remove non-ASCII characters
text = iconv(text, "latin1", "ASCII", sub="")
A sample of the raw data is below:
head(text, 3)
## [1] "He wasnt home alone apparently"
## [2] "The St Louis plant had to close It would die of old age Workers had been making cars there since the onset of mass automotive production in the s"
## [3] "WSUs plans quickly became a hot topic on local online sites Though most people applauded plans for the new biomedical center many deplored the potential loss of the building"
To get an idea of the distribution of words in the data a TermDocumentMatrix is created:
corpus <- VCorpus(VectorSource(text))
tdm <- as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf))))
frequencycount <- rowSums(tdm)
frequencycount <- sort(frequencycount, decreasing=TRUE)
This allows us to view the most frequent words. Note that I purposefully did not remove stop words as I believe that doing so would negatively affect the quality of the predictions.
load("frequencycount.RData")
head(frequencycount, 12)
## the to and a of in for that is on with said
## 18723 8634 8510 8392 7121 6511 3380 3320 2722 2701 2387 2373
We can also view a wordcloud of the data
load("corpus.RData")
pal2 <- brewer.pal(8,"Dark2")
wordcloud(corpus, scale=c(6,.2), min.freq=2, max.words=200, random.order=TRUE,
rot.per=0.5, colors=pal2, use.r.layout=FALSE)
After a great deal of trial and error I ended up tokenizing the data as follows:
This results in five models, one for each length of ngram, which are combined into one large model.
# Create an empty data frame
ngrams <- data.frame(x1=character(),
x2=character(),
x3=character(),
x4=character(),
x5=character(),
y=character(),
weight=integer(),
stringsAsFactors=TRUE)
# Create ngrams from the text
for(i in 2:6){
ngramfreq <- createngrams(text, i)
model <- createmodel(ngramfreq, 6)
ngrams <- rbind(ngrams, model)
rm("ngramfreq","model")
}
# Reorder the columns
columns <- c("x5","x4","x3","x2","x1","y","weight")
ngrams <- ngrams[columns]
The reason for reversing the strings is that I believe that the words closest to the end of the input are the most important for predicting the next word. By starting at the end and working our way backwards through the input we can immediately remove any ngrams which do not fit the input and then narrow the results down as we work backwards. The direction will become important when the prediction algorithm is described below, especially as concerning the flexibility to omit words which do not exist in the model.
Function cleanmodel takes in a list of ngrams and a threshhold and keeps only threshhold y’s for each unique combination of x’s. Since I will return three predictions for each input string there is no need to keep more than three rows for each unique ngram. The ngrams with the highest weight are kept.
# Clean the dataframe
ngrams <- cleanmodel(ngrams, 3)
I noticed that many of the ngrams contained either misspelled words, non-English words or other combinations of characters which were not recognizable as English. To address this I downloaded a list of 450,000 valid English words and filtered the ngrams with it.
# Load valid words
words <- read.csv("words.txt")
words <- words$X2
words <- tolower(words)
# Create a list of unique words from the model
wordlist <- c(as.character(ngrammodel$y), as.character(ngrammodel$x5), as.character(ngrammodel$x4), as.character(ngrammodel$x3), as.character(ngrammodel$x2), as.character(ngrammodel$x1))
wordlist <- unique(wordlist)
# Create a list of words which are NOT valid words
filter <- apply(as.matrix(wordlist),1,checkword)
bad_words <- wordlist[!filter]
bad_words <- unique(bad_words)
I started by removing all rows which contained invalid words in the y column, as I do not want the model to predict words which don’t exist.
y <- ngrammodel$y
filter <- unlist(lapply(as.character(y), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
# Rename the rows
rownames(ngrammodel) <- 1:nrow(ngrammodel)
I was ambivalent about cleaning the other columns, as non-existent words would not negatively effect the outcome of the algorithm and could only benefit the predictions if a user happens to type in a non-valid word. In the end I did clean the other columns for the purpose of getting the data down to a reasonable size, and included logic for handling words which were not recognized in the prediction model.
The code to remove invalid words from the entire ngram data frame is in Appendix I in function validateWords.
As there are far more possibilities for longer ngrams than for shorter ones, the shorter ngrams had far higher frequencies than longer ones, and thus much higher weights. The most frequent bigrams occured as many as 20,000 times in the data, while the maximum frequency for six-grams was closer to ten. Both the mean and median for frequency of ngrams of any length were 1 or close to 1.
To attempt to standardize the weighting I transformed it on a log 10 scale, replacing any instance of 0 with 0.1. This brought the range of weighting to 0.1 to 4. This would allow for easier scaling of the weighting, which may still be desirable in the future, as described below.
ngrammodel <- scaleweight(ngrams)
Longer ngrams should generally have a higher weight than shorter ones as they will tend to represent better predictions. I considered scaling the weights accordingly, but in the end decided that this issue could be handled in the logic of the prediction function, which is described below.
I attempted to fit some decision trees to the data, but the RAM required to do so far exceeded the RAM available on my machine. After much trial and error I decided on the model created above, as it provided a reasonable balance of power and flexibility when combined with the prediction algorithm.
The prediction algorithm works as follows:
The prediction function follows:
# create a prediction function
predictText <- function(x){
string <- tolower(x)
string <- removeNumbers(string)
string <-removePunctuation(string)
string <- strsplit(string, " ")
string <- rev(string[[1]])[1:5]
len <- length(string)
# Try an easier way
i <- 1
matches <- ngrammodel
for(word in string){
if(!is.na(word)){
# Added check for NA to try to prevent overfitting
# | is.na(matches[,i])
temp <- subset(matches, (matches[,i] == word ))
# if we have matches keep them, otherwise skip the word and continue
if(nrow(temp) >= 2){
matches <- temp
}
# Check how many unique ys there are, if we have less than 3 we can break and return
# and no need to continue looping
if(length(unique(matches$y)) <= 3){
break;
}
i <- i + 1
} else {
break
}
}
# Filter out duplicate ys because they don't add anything
matches = matches[!duplicated(matches$y, fromLast=TRUE),]
# sort the matches
matches <- matches[ order(matches$weight, row.names(matches), decreasing=TRUE), ]
# Take the top 3
results <- head(unique(matches$y), 3)
results <- as.character(results)
return(results)
}
Rather than weighting longer ngrams higher I decided to take care of this in the prediction function. My concern was that with strict subsetting the function would often return no predictions for an input which did not exist in the model. This was addressed by skipping any words which returned no results.
I also attempted to allow for the presence of an NA in any given slot, but this resulted in very low quality matches. The final compromise, as detailed above, was to ignore any word which reduced the size of potential matches to below two.
Ideally I would revisit the weighting as, combined with different logic for handling unknown ngrams, might result in higher quality predictions.
source("predictionModel.R")
predictText("i like to")
# [1] "play" "go" "put"
A Shiny application was created to feature this algorithm, which is available at https://ericscuccimarra.shinyapps.io/TextPrediction2/.
The application uses a dataset which filters out 75% of the text from the Twitter and Blog data, in order to provide a reasonable response speed. Using more data would provide in better predictions.
The source code is available on my GitHub at https://github.com/escuccim/DataScienceCapstone. Note that this repository contains all of the R scripts I created for this project, some of which are not used.
To create a prediction model follow these steps:
Be aware that running this entire process as is will take a long time and require a lot of CPU and RAM.
The file biggerData.R includes all of the steps to create an ngrammodel, using a larger subset of the data than I originally used.
# Split the strings into features
splitngrams <- function(x) {
strsplit(x, " ")
}
# Turn a character vector of text into a data frame of ngrams with their associated frequency
createngrams <- function(sentences, n, threshold=0){
ngrams <- tokenize_ngrams(sentences, lowercase=TRUE, n=n, n_min=n, simplify=TRUE)
# Unlist it
ngrams <- unlist(ngrams)
# Get frequency
ngramfreq <- as.data.frame(table(ngrams))
rm("ngrams")
# Order and save the two grams
ngramfreq <- ngramfreq[order(-ngramfreq$Freq),]
if(threshold > 0){
# Filter out ngrams that only occur once
filter <- ngramfreq$Freq > threshhold
ngramfreq = ngramfreq[filter,]
rm("filter")
}
# Return ngram frequency
return(ngramfreq)
}
createmodel <- function(ngramfreq, cols=6){
ngramfreq$ngrams <- as.character(ngramfreq$ngrams)
splits <- splitngrams(ngramfreq$ngrams)
splits <- do.call(rbind, splits)
splits <- as.data.frame(splits)
# Name the columns
numcols = ncol(splits)
numfeatures = numcols - 1
names(splits)[numcols] <- "y"
nums <- (cols+1-numcols):(5)
names = c(paste("x",nums,sep=""),"y")
names(splits) <- names
splits$weight <- (ngramfreq$Freq)
ngrammodel <- splits
# Order the data frame by weight
ngrammodel <- ngrammodel[with(ngrammodel, order(x5,-weight)),]
# Reindex it
rownames(ngrammodel) <- 1:nrow(ngrammodel)
# Add cols with NAs where appropriate
if(numcols < cols){
newcols = 1:(cols-numcols)
newcolnames = paste("x",newcols,sep="")
ngrammodel[,newcolnames] <- NA
}
return(ngrammodel)
}
# Since we only return the top three matches we only need to keep the top 3 ys for each combination of x's
cleanmodel <- function(model, threshhold=3) {
numx <- ncol(model) - 2
xcols <- 1:numx
cols <- paste("x", xcols, sep="")
# Create an empty dataframe with same columns
newmodel <- model[0,]
# Get unique combinations of cols
uniqueRows <- unique(model[cols])
filter = rep(FALSE, nrow(model))
fillvalues = rep(TRUE, threshhold)
for( i in rownames(uniqueRows) ){
i <- as.numeric(i)
filter[i:(i+2)] <- TRUE
}
model <- model[filter,]
# Drop the last two rows because they are empty
nrows <- nrow(model)
model <- model[1:(nrows-2),]
# Renumber the rows
rownames(model) <- 1:nrow(model)
return(model)
}
cleanData <- function(data){
data <- removeNumbers(data)
data <- removePunctuation(data)
data
}
pad <- function(x, n) {
len.diff <- n - length(x)
c(x, rep(NA, len.diff))
}
checkword <- function(word){
return(word %in% words)
}
scaleweight <- function(ngrams){
ngrams$weight <- log10(ngrams$weight)
ngrams$weight <- ngrams$weight + 0.1
return(ngrams)
}
# remove non-words
checkwords <- function(list){
if(!exists("words")){
words <- valid_word_list("words.txt")
}
filter <- apply(as.matrix(list),1,checkword)
return(filter)
}
# make a model for a maximum of six-grams from a given text corpus.
# threshhold specifies a minimum required number of occurences in the text of an ngram
makedata <- function(text, threshhold=0) {
sentences <- unlist(tokenize_sentences(text))
rm(text)
# Create an empty data frame
ngrams <- data.frame(x1=character(),
x2=character(),
x3=character(),
x4=character(),
x5=character(),
y=character(),
weight=integer(),
stringsAsFactors=TRUE)
# Create ngrams from the text
for(i in 2:6){
ngramfreq <- createngrams(sentences, i, threshhold)
if(threshhold > 0){
ngramfreq <- subset(ngramfreq, ngramfreq$Freq > threshhold)
}
model <- createmodel(ngramfreq, 6)
ngrams <- rbind(ngrams, model)
rm("ngramfreq","model")
}
rm("text")
# Reorder the columns
columns <- c("x5","x4","x3","x2","x1","y","weight")
ngrams <- ngrams[columns]
return(ngrams)
}
# For an ngrammodel and a list of INVALID words, remove any rows which contain invalid words
validateWords <- function(ngrammodel, bad_words) {
y <- ngrammodel$y
filter <- unlist(lapply(as.character(y), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
x5 <- ngrammodel$x5
filter <- unlist(lapply(as.character(x5), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
rm(x5)
x4 <- ngrammodel$x4
filter <- unlist(lapply(as.character(x4), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
rm(x4)
x3 <- ngrammodel$x3
filter <- unlist(lapply(as.character(x3), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
rm(x3)
x2 <- ngrammodel$x2
filter <- unlist(lapply(as.character(x2), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
rm(x2)
x1 <- ngrammodel$x1
filter <- unlist(lapply(as.character(x1), function(x) x %in% bad_words))
ngrammodel <- ngrammodel[!filter,]
rm(x1)
rownames(ngrammodel) <- 1:nrow(ngrammodel)
return(ngrammodel)
}
# Get list of unique words
uniquewords <- function(ngrammodel){
wordlist <- c(as.character(ngrammodel$y), as.character(ngrammodel$x5), as.character(ngrammodel$x4), as.character(ngrammodel$x3), as.character(ngrammodel$x2), as.character(ngrammodel$x1))
wordlist <- unique(wordlist)
return(wordlist)
}
# Get list of which words are NOT valid
invalid_word_list <- function(wordlist){
filter <- apply(as.matrix(wordlist),1,checkword)
bad_words <- wordlist[!filter]
bad_words <- unique(bad_words)
return(bad_words)
}
# create a vector of valid words from a text list
valid_word_list <- function(file) {
words <- read.csv(file)
words <- tolower(words$X2)
save(words, file="validwords.RData")
return(words)
}