The goal of this project is to predict the next word based on user input, similar to the SwiftKey algorithm used on iOS and Android mobile devices. Natural Language Processing (NLP) is used to clean the data and build n-gram models, then a stupid Back-off algorithm is used to look for the most likely next words.
Based on an n-gram model (using unigram, bigram, trigram and quadgram), the stupid back-off method is used to determine the next likely word in the order of quadgram, trigram, bigram and unigram. Specific weights are assigned to the different ngrams, so that the words determined by the higher ngram (say quadgram) will have higher likelyhood of appearing as the outcome than lower ngrams such as trigram or bigram. If no next word could be determined by the higher ngrams, the unigram with highest likelihood is returned.
This project took samples of past historical user inputs via Blog postings, News articles and Twitter feeds to do natural language processing so that a prediction of the next word is provided given a user input of one or more words. Using the Blog, News and Twitter datasets provided from the HC Corpora (www.corpora.heliohost.org), a smaller sample is taken to create a model using quadgram (4 words), trigram (3 words), bigram (2 words) and unigram (1 word) along with the respective frequency of occurance.
The ngram models are developed after cleansing the data of the following 1. converting them to lower 2. removing the punctuation 3. removing the numbers 4. removing the stopwords 5. removing extra whitespaces
The prediction is done like this - if a sentence more than 3 words are entered, the last 3 words are taken for prediction. However, if it is equal to or less than 3, then it is retained as such for determining the next word. From the statement, last word is used in Bigram to determine the next word; similarly last 2 words are used in Trigram and 3 words are used in Quadgram to determine the next words. If no input is entered or the inputs couldnot be matched to any of the ngrams, then the unigram with highest frequencies are returned.
This is a simplistic model which is as good as the input gets. Hence this is really based on the text taken and contextualized to blogs, news and twitter. If a test case is text taken from say William Shakespeare’s play, this predictor may not work.
This predictor is also limited by the model, being optimised using the most frequently occuring two and three words. Hence, for a four or five word phrase, it may not measure up to expectation since only bigrams and trigrams were used. However, with some effort, 4-word and 5-word models can be built to extend this predictor.
The data is downloaded from swiftkey url (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) automatically if the working directory the “data” subdirectory is not present. Else, the script assumes that the data is already downloaded and the files are present in the directory.
If the data directory is not present in the working directory, the data is downloaded and unzipped by the script. This is to ensure that the data is not downloaded again, to save time and bandwidth.
# To avoid downloading the huge dataset again and again, it is downloaded and
# unzipped only when the data directory is missing.
# Get current working dir
maindir <- getwd()
subdir <- "data"
# Download file and unzip only when the data directory is not present
chkdir <- dir.exists(file.path(maindir, subdir))
if (chkdir!=TRUE) {
# Create data directory and set that as the data directory
dir.create(file.path(workdir, subdir), showWarnings = FALSE)
setwd(file.path(maindir, subdir))
# specify the source and destination of the download
dest.file <- "Coursera-SwiftKey.zip"
src.file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# execute the download
download.file(src.file, dest.file)
# extract the files from the zip file
unzip(dest.file)
}
The key component of Natural Language Processing (NLP), to predict the next word, is ngrams. According to Wikipedia, In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on.
There are various conditions built in the script to save time. If the n-grams are already saved in the data directory, then it is loaded directly for processing, else, the ngrams are created. There are few steps in the creation of ngrams. First of all the data is sampled from a huge collection of corpus, twitter, news and blog. Then the data is futher sampled into smaller training and test datasets for training and testing the model. The raw data viz. twitter, news and blog are saved in the .RData format for faster loading and processing. Similarly, the corpus is also saved in .RData format.
Finally after the ngrams are generated from the training data, they are saved in a .RData file for faster loading. Eventually when the function is run, it looks for saved .RData file. It loads them if found, if not, it creates and saves the 3 RData files, namely rawdata.RData, corpus.RData and ngrams.RData. The .RData files need to be deleted if the intention is to regenerate the files, say after some parameters such as seed or sample size etc. are changed.
#====================================
# n-gram extraction function
#====================================
ngramextract <- function(){
#-----------------------------
# Initial setup
#-----------------------------
# Load libraries
suppressWarnings(suppressMessages(library(stringi)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(slam)))
suppressWarnings(suppressMessages(library(RWeka)))
# Set working directory
setwd("~/RProject/DataScienceCoursera/10_Capstone Project/coursera-datascience-capstone")
# Number of records in the smaller dataset
samplerecords <- 20000
# Number of records in training dataset
nlinestr <- samplerecords * 0.9
# Number of records in testing dataset
nlinests <- samplerecords * 0.1
#-----------------------------
# NGRAMS CREATION
#-----------------------------
# Load data. If that does not exist then continue with the following steps
# Load the ngrams data if they exist
checkn <- file.exists(file.path("data/ngrams.RData"))
if (checkn==F){
#-----------------------------
# RAW DATA
#-----------------------------
# Load the rawdata if exists, else create it
checkr <- file.exists(file.path('data/rawdata.RData'))
if (checkr==F){
# read in the data
twit <- readLines('data/final/en_US/en_US.twitter.txt', encoding = 'UTF-8')
news <- readLines('data/final/en_US/en_US.news.txt' , encoding = 'UTF-8')
blog <- readLines('data/final/en_US/en_US.blogs.txt' , encoding = 'UTF-8')
# Save rawdata
save(twit, news, blog, file='data/rawdata.RData')
} else {load('data/rawdata.RData')}
#-----------------------------
# CORPUS DATA
#-----------------------------
# Load the corpus data if exists, else create it
checkc <- file.exists(file.path('data/corpus.RData'))
if (checkc==F){
# Set Seed for all sampling
set.seed(1000)
# Smaller dataset
twit <- twit[sample(1:length(twit), samplerecords)]
news <- news[sample(1:length(news), samplerecords)]
blog <- blog[sample(1:length(blog), samplerecords)]
# Sampling of twitter into training and testing
index <- sample(1:length(twit), nlinestr)
sampletwittr <- twit[index] # Train
sampletwitts <- twit[-index] # Test
# Sampling of news into training and testing
index <- sample(1:length(news), nlinestr)
samplenewstr <- news[index]
samplenewsts <- news[-index]
# Sampling of blog into training and testing
index <- sample(1:length(blog), nlinestr)
sampleblogtr <- blog[index]
sampleblogts <- blog[-index]
# Subsetting the test datasets
indexts <- sample(1:nlinestr, nlinests)
sampletwitts <- sampletwitts[indexts]
samplenewsts <- samplenewsts[indexts]
sampleblogts <- sampleblogts[indexts]
#-----------------------------
# clean the data
#-----------------------------
## clean twitter data
asciitwittr <- stri_enc_toascii(sampletwittr)
asciitwittr <- stri_replace_all_regex(asciitwittr,'\032','')
asciitwitts <- stri_enc_toascii(sampletwitts)
asciitwitts <- stri_replace_all_regex(asciitwitts,'\032','')
## clean news dats
asciinewstr <- stri_enc_toascii(samplenewstr)
asciinewstr <- stri_replace_all_regex(asciinewstr,'\032','')
asciinewsts <- stri_enc_toascii(samplenewsts)
asciinewsts <- stri_replace_all_regex(asciinewsts,'\032','')
## clean blogdata
asciiblogtr <- stri_enc_toascii(sampleblogtr)
asciiblogtr <- stri_replace_all_regex(asciiblogtr,'\032','')
asciiblogts <- stri_enc_toascii(sampleblogts)
asciiblogts <- stri_replace_all_regex(asciiblogts,'\032','')
# Train and Test data
train <- VectorSource(c(asciitwittr, asciinewstr, asciiblogtr))
test <- VectorSource(c(asciitwitts, asciinewsts, asciiblogts))
# Assign to corpus of training and testing
corpustr <- Corpus(train)
corpusts <- Corpus(test)
# Clean memory
rm(blog, news, twit, sampleblogtr, samplenewstr)
rm(sampletwittr, sampleblogts, samplenewsts, sampletwitts)
# Save and load the corupus in Rdata format
save(corpustr, corpusts, file='data/corpus.RData')
} else {load('data/corpus.RData')}
# Load profane words
profanewords <- readLines("data/profanityfilter.txt", encoding = "UTF-8")
# Cleanse the data
corpustr <- tm_map(corpustr, content_transformer(tolower))
corpustr <- tm_map(corpustr, removePunctuation)
corpustr <- tm_map(corpustr, removeNumbers)
corpustr <- tm_map(corpustr, removeWords, stopwords("english"))
corpustr <- tm_map(corpustr, removeWords, profanewords)
corpustr <- tm_map(corpustr, stripWhitespace)
#-----------------------------
# Generate ngram frequencies
#-----------------------------
# sentence delimiters; prevent clustering across sentence boundaries
delimiters <- " \\t\\r\\n.!?,;\"()"
# Tokenize
library(RWeka)
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
# Controls for the ngrams
uni.control <- list(tokenize = unigramTokenizer , bounds = list(global = c(6,Inf)))
bi.control <- list(tokenize = bigramTokenizer , bounds = list(global = c(3,Inf)))
tri.control <- list(tokenize = trigramTokenizer , bounds = list(global = c(2,Inf)))
quad.control <- list(tokenize = quadgramTokenizer, bounds = list(global = c(2,Inf)))
# Since the NGramTokenizer hangs on the parallel::mclapply call,
# changing the number of threads sorts the issue
# Sets the default number of threads to use
options(mc.cores=1)
# Generate the Term document matrix
library(tm)
unigram.tdm <- TermDocumentMatrix(corpustr, control = uni.control)
bigram.tdm <- TermDocumentMatrix(corpustr, control = bi.control)
trigram.tdm <- TermDocumentMatrix(corpustr, control = tri.control)
quadgram.tdm <- TermDocumentMatrix(corpustr, control = quad.control)
## Generate frequencies of ngrams
library(slam)
freq.uni <- rowapply_simple_triplet_matrix(unigram.tdm, sum)
freq.bi <- rowapply_simple_triplet_matrix(bigram.tdm, sum)
freq.tri <- rowapply_simple_triplet_matrix(trigram.tdm, sum)
freq.quad <- rowapply_simple_triplet_matrix(quadgram.tdm, sum)
# Save and load Rdata file
save(freq.uni, freq.bi, freq.tri, freq.quad, file = 'data/ngramsfreq.RData')
#-----------------------------
# Generate the ngram dataframes
#-----------------------------
# Split the strings into individual words
firstname.bi <- sapply(strsplit(names(freq.bi), " "), function(a) a[1])
secname.bi <- sapply(strsplit(names(freq.bi), " "), function(a) a[2])
firstname.tri <- sapply(strsplit(names(freq.tri), " "), function(a) a[1])
secname.tri <- sapply(strsplit(names(freq.tri), " "), function(a) a[2])
thirname.tri <- sapply(strsplit(names(freq.tri), " "), function(a) a[3])
firstname.quad <- sapply(strsplit(names(freq.quad), " "), function(a) a[1])
secname.quad <- sapply(strsplit(names(freq.quad), " "), function(a) a[2])
thirname.quad <- sapply(strsplit(names(freq.quad), " "), function(a) a[3])
fourname.quad <- sapply(strsplit(names(freq.quad), " "), function(a) a[4])
# get the final n-gram dataframe (Data table)
library(data.table)
## Unigram data table
unigramDT <- data.table(unigram = names(freq.uni),
freq = freq.uni)
## Bigram data table
bigramDT <- data.table(bigram = names(freq.bi),
freq = freq.bi,
unigram = firstname.bi,
nextword = secname.bi)
## Trigram data table
trigramDT <- data.table(trigram = names(freq.tri),
freq = freq.tri,
unigram = paste(firstname.tri, secname.tri),
nextword = thirname.tri)
## Quadgram data table
quadgramDT <- data.table(quadgram = names(freq.quad),
freq = freq.quad,
trigram = paste(firstname.quad, secname.quad, thirname.quad),
nextword = fourname.quad)
# Sort by frequency
unigramDT <- unigramDT[order(freq, decreasing=TRUE),]
bigramDT <- bigramDT[order(freq, decreasing=TRUE),]
trigramDT <- trigramDT[order(freq, decreasing=TRUE),]
quadgramDT <- quadgramDT[order(freq, decreasing=TRUE),]
# Save and load Rdata file
save(unigramDT, bigramDT, trigramDT, quadgramDT, file = 'data/ngrams.RData')
} else {load('data/ngrams.RData')}
# Output data
ngrams <- list(unigramDT, bigramDT, trigramDT, quadgramDT)
# Return output data
return(ngrams)
}
ngrams <- ngramextract()
The generated ngrams are saved in the data folder. They can be loaded easily when the ngramextract function is called. If the ngrams.RData file is not found in the data folder, the function goes on to create them, which could be time-consuming and system resource intensive.
# Load the ngrams dataframes
load('data/ngrams.RData')
# Sort the ngrams
unigramDF <- unigramDT[order(unigramDT$freq, decreasing=T),]
bigramDF <- bigramDT[order(bigramDT$freq, decreasing=T),]
trigramDF <- trigramDT[order(trigramDT$freq, decreasing=T),]
quadgramDF <- quadgramDT[order(quadgramDT$freq, decreasing=T),]
# Number of nGrams to plot
n <- 30
# Get the highest 30 numbers of nGrams
unigram.n30 <- unigramDF[1:n, ]
bigram.n30 <- bigramDF[1:n, ]
trigram.n30 <- trigramDF[1:n, ]
quadgram.n30 <- quadgramDF[1:n, ]
The histogram indicate the frequency of the top 30 ngrams. This gives a rough idea of the kind of words and ngrams stored in the corpus.
# Load needed libraries
library(ggplot2)
# Unigram histogram
ggplot(unigram.n30, aes(x = unigram, y = freq)) +
geom_bar(stat = "Identity") +
ggtitle("Top 30 frequencies of Unigrams") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Bigram histogram
ggplot(bigram.n30, aes(x = bigram, y = freq)) +
geom_bar(stat = "Identity") +
ggtitle("Top 30 frequencies of Bigrams") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Trigram histogram
ggplot(trigram.n30, aes(x = trigram, y = freq)) +
geom_bar(stat = "Identity") +
ggtitle("Top 30 frequencies of Trigrams") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Quadgram histogram
ggplot(quadgram.n30, aes(x = quadgram, y = freq)) +
geom_bar(stat = "Identity") +
ggtitle("Top 30 frequencies of Quadgrams") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Load library
library(wordcloud)
# Load ngram frequency data
load("Data/ngramsfreq.RData")
# Combine multiple wordclouds into one graph
par(mfrow=c(1,4))
# Max words for the wordcloud
maxword <- 30; xlim <- 0.5; ylim = 1.2
# Unigram word cloud
wc1 <- wordcloud(names(freq.uni), freq.uni, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Unigrams Word Cloud")
# Bigram word cloud
wc2 <- wordcloud(names(freq.bi), freq.bi, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Bigrams Word Cloud")
# Trigram word cloud
wc3 <- wordcloud(names(freq.tri), freq.tri, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Trigrams Word Cloud")
# Quadgram word cloud
wc4 <- wordcloud(names(freq.quad), freq.quad, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Quadgrams Word Cloud")
This is the most important step in the process whereby the prediction model is developed. The prediction model is based on identification of the next words from the appropriate ngrams, assigning them a weighted score and predicting the required number of next words based on the user specifications. To ensure higher accuracy, weights are assigned in reducing order from QuadGram, TriGram, BiGram and UniGram. Which means that if the next word is found in Quadram, it will carry higher weightage than the remaining three nGrams. The weights assigned are nearly arbitrary, and further research is needed to determing the relative weights for higher accuracy.
The input text is first cleansed before the next word is found. The cleansing include removing punctuations, converting the characters to lower case, removing numbers, removing the english stop words etc. The cleansed text is outputted as either a vector or a string depending on where the function is used. The string output is used in the “Shiny App” where the cleansed input text is shown. The character vector is used within the predict function.
#====================================
# Cleanse Input function
#====================================
textcleanse <- function(input, type){
# type of output, vector or string
# v = character vector, s = string
if (type=="s") {
input <- tolower(input)
input <- removePunctuation(input)
input <- removeNumbers(input)
input <- unlist(strsplit(input, " "))
input <- setdiff(input, stopwords(kind = "en"))
input <- input[grepl('[[:alpha:]]', input)]
input <- paste(input, sep=" ", collapse=" ")
} else {
input <- tolower(input)
input <- removePunctuation(input)
input <- removeNumbers(input)
input <- rev(unlist(strsplit(input, " ")))
input <- setdiff(input, stopwords(kind = "en"))
input <- input[grepl('[[:alpha:]]', input)]
}
return(input)
}
#====================================
# Predict next word function
#====================================
predictnextword <- function( input, maxwords ) {
#-----------------------------
# Constants
#-----------------------------
# ngram weights
quadgramwt <- 1
trigramwt <- 0.6
bigramwt <- 0.4
unigramwt <- 0.16
#-----------------------------
# Load ngrams data
#-----------------------------
# Load ngrams
load("data/ngrams.RData")
# Cleanse the input text
input <- textcleanse(input, "v")
# Inputs for searching quadgram, trigram and unigrams
input.tri <- paste(input[3], input[2], input[1], sep = ' ')
input.bi <- paste(input[2], input[1], sep = ' ')
input.uni <- input[1]
#-----------------------------
# Searching next word in ngrams
#-----------------------------
quadgramsearch <- grepl(paste0("^", input.tri, "$"), quadgramDT$trigram)
quadgramsubset <- quadgramDT[quadgramsearch,]
if (sum(quadgramsearch) == 0) {
quadgramsubset <- quadgramDT[1,]
quadgramsubset$nextword <- "n.a."}
trigramsearch <- grepl(paste0("^", input.bi, "$"), trigramDT$bigram)
trigramsubset <- trigramDT[trigramsearch,]
if (sum(trigramsearch) == 0) {
trigramsubset <- trigramDT[1,]
trigramsubset$nextword <- "n.a."}
bigramsearch <- grepl(paste0("^",input.uni,"$"), bigramDT$unigram)
bigramsubset <- bigramDT[bigramsearch,]
if (sum(bigramsearch) == 0) {
bigramsubset <- bigramDT[1,]
bigramsubset$nextword <- "n.a."}
unigramDT <- unigramDT[order(unigramDT$freq, decreasing = T),]
unigramsubset <- unigramDT[1:maxwords,]
# Compute probability scores for the ngrams
unigramsubset$score <- unigramsubset$freq / sum(unigramDT$freq) * unigramwt
bigramsubset$score <- bigramsubset$freq / sum(bigramsearch) * bigramwt
trigramsubset$score <- trigramsubset$freq / sum(trigramsubset$freq) * trigramwt
quadgramsubset$score <- quadgramsubset$freq / sum(quadgramsubset$freq) * quadgramwt
#-----------------------------
# Predict the word based on ngram results
#-----------------------------
# Determine the predicted words and their score
# predicted word for the dataframe
nextword <- c(quadgramsubset$nextword, trigramsubset$nextword,
bigramsubset$nextword, unigramsubset$unigram)
# scores for the predicted words
score <- c(quadgramsubset$score, trigramsubset$score,
bigramsubset$score, unigramsubset$score)
# Build the predicted word dataframe
predictedword <- data.frame(next_word = nextword, score = score, stringsAsFactors = F)
# Remove n.a.
predictedword <- subset(predictedword, next_word != "n.a.")
#-----------------------------
# Final processing of predicted words
#-----------------------------
# Summarize the dataframe
suppressWarnings(suppressMessages(library(dplyr)))
predictedword <- predictedword %>%
group_by(next_word) %>%
summarize(score=sum(score))
# Sort the dataframe using the score
predictedword <- predictedword[order(predictedword$score, decreasing = T),]
# Remove duplicates
predictedword <- unique(predictedword$next_word)
# Retain only the max results
predictedword <- predictedword[1:maxwords]
# Remove spaces
predictedword <- predictedword[grepl('[[:alpha:]]',predictedword)]
output <- paste(predictedword, sep=",", collapse=", ")
#-----------------------------
# Return final outcome
#-----------------------------
return(output)
}
# Testing 1
inputtext <- "mg cholesterol mg sodium"
nextwords <- predictnextword(inputtext, 3)
The predicted next word(s) are: g, said, will
# Testing 2
inputtext <- "president barack obama"
nextwords <- predictnextword(inputtext, 3)
The predicted next word(s) are: said, administration, will
# Testing 3
inputtext <- "two years"
nextwords <- predictnextword(inputtext, 3)
The predicted next word(s) are: ago, old, later
# Testing 4
inputtext <- "even"
nextwords <- predictnextword(inputtext, 3)
The predicted next word(s) are: though, know, better
# Testing 5
inputtext <- ""
nextwords <- predictnextword(inputtext, 3)
The predicted next word(s) are: said, will, one
Based on the next word prediction, a Shiny Apps is deployed. The app takes an input string and the number of words to be predicted. Then it predicts the next word based on the back-off algorithm based on the ngrams. The algorithm attaches higher weights to higher ngrams predictions to ensure higher accuracy. However, the weights attached to the ngrams will need further testing to determine the weights that will produce higher accuracy.