Word Prediction - A Capstone Project Report

Introduction

The purpose of this project is to develop a NLP model that can simulate the Swift-key word prediction. For this purpose have 3 sources of text corpus to work with viz. twitter, news and blogs. To build a successful NLP model we will work in folowing stages.

Loading and Preprocessing the data
Perform some exploratory analysis
Build a NLP model based on either Markoff chain and Back off Algorithm
Train the model
Validate the model and assess the accuracy
Build a Shiny application to demonstrate the working of the model

In this report we will focus only on the Preprocessing and Exploratory analysis of the data.

Preprocessing and Cleaning Data

Installing the required packages

required.packages <- c("ggplot2", "quanteda", "readr", "stringr", "kableExtra")

for (pakg in required.packages) {
  if (pakg %in% installed.packages()){
    require(pakg, character.only = TRUE)
  } else {
    install.packages(pakg)
    require(pakg, character.only = TRUE)
  }
}

There are 3 sources for textual data provided in the link here. After unzipping the file we get 3 text files of Twitter, news and blogs data.

Loading the data

#Creating connection to Twitter data from Swift Key Corp.
twitterData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")


#Creating connection to News data from Swift Key Corp.
newsData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.news.txt")


#Creating connection to blog data from Swift Key Corp.
blogData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")

Take a Peek

After successfully loading the data into the R session, we take a cursory glance at the data.

	Words	Charaters	Lines	Memory
News	35624454	203223159	1010242	269840992
Blogs	38309620	206824505	899288	267758632
Twitter	31003545	162096248	2360150	334485064

As we can see that the amount of data to be processed is huge, so we take sample data (about 5000 lines) from the raw data to create our corpus.

Creating sample corpus

# Creating sample data from Twitter, News and blogs

nlines <- 5000

tweets <- sample(twitterData, nlines, replace = FALSE)
news <- sample(newsData, nlines, replace = FALSE)
blogs <- sample(blogData, nlines, replace = FALSE)

# Removing the original datasets from the memory

rm(twitterData, newsData, blogData)
gc()

##           used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 2031050 108.5    7324825  391.2   7143677  381.6
## Vcells 8042456  61.4  170555061 1301.3 213061023 1625.6

Lets explore the sampled data and see whether we can work with this corpus.

	Words	Charaters	Lines	Memory
News	176461	1006025	5000	1335680
Blogs	209057	1127267	5000	1465464
Twitter	65008	340709	5000	714856

The sampled corpus is of ideal size and consumes less memeory to process. We then combine all the sources of data to create a single large combined corpus to make 1,2,3,4 tokens clea tokens

Cleaning Corpus & Creating tokens using QUANTEDA

To create a tooken object and clean the tokens we will use R’s Quanteda package against the more popular tm and tidytext packages for text minnig. The quanteda pakage supports multi-threading and is generally fast compared to the tm package.

The following text cleaning is performed.

Remove numbers
Remove Punctuation
Remove Symbols
Remove Twitter( # tags)
Remove URL
Coversion to lowercase
Remove Profane words

I have not removed the stop words and also stemmed words as we will be comparing the user inputs (as is) with the most frequently used tokens. Doing so might lead to loss of information and might lead to incorrect word prediciton.

## Creating clean tokens
# Tokenize tweets messages.

# combining the sample data sets to make a Corpus
sample.corpus <- as.character(c(tweets, news, blogs))

sample.tokens <- tokens(sample.corpus, what = "word", 
                       remove_numbers = TRUE, remove_punct = TRUE,
                       remove_symbols = TRUE, remove_hyphens = TRUE,
                       remove_separators = TRUE,remove_twitter = TRUE, 
                       remove_url = TRUE)
                       
# Lower case the tokens.
sample.tokens <- tokens_tolower(sample.tokens)

# Remove profane words
profane.words <- read_lines("./Coursera-SwiftKey/final/en_US/profane.txt")
sample.tokens <- tokens_remove(sample.tokens, profane.words)

As a start, the NLP model we will use for predicting text is the N-gram model that is based on the concept of a Markov chain. Put in simpler terms, we will compute probability for one word to come after a sentence fragment with the assumption that this probability is dependent only on the n-1 preceding words. For example if chossing n=3 (tri-gram) and my sentence fragment is: “I had breakfast this.”, the probability of the next word being “morning” is dependent only on the two preceding words: “breakfast” & “this”. For dealing with the sparsity of text data, we will use Katz’s Back-off approach as well as Good-Turing smoothing.

Create 1,2,3,4 grams from the Tokens

# Creating N-gram tokens
sample.1gram <- tokens_ngrams(sample.tokens, n = 1)
sample.2gram <- tokens_ngrams(sample.tokens, n = 2)
sample.3gram <- tokens_ngrams(sample.tokens, n = 3)

# Create our first bag-of-words model.
tokens.1gram.dfm <- dfm(sample.1gram, tolower = FALSE)
tokens.2gram.dfm <- dfm(sample.2gram, tolower = FALSE)
tokens.3gram.dfm <- dfm(sample.3gram, tolower = FALSE)


# Converting into data frame

dfm_to_df <- function(x) {
     x.df <- data.frame(Content = featnames(x), Frequency = colSums(x), 
                 stringsAsFactors = FALSE)
     x.df <- x.df[with(x.df, order(-Frequency)),]
     row.names(x.df) <- NULL
     return(x.df)
}

tokens.1gram.df <- dfm_to_df(tokens.1gram.dfm)
tokens.2gram.df <- dfm_to_df(tokens.2gram.dfm)
tokens.3gram.df <- dfm_to_df(tokens.3gram.dfm)

Exploring the Corpus

The top 5 most frequently used tokens(1,2,3) is tabled below.

1-gram Frequency
Content	Frequency
the	21633
to	12061
and	11290
a	10715
of	9396
in	7387
i	6492
that	4654
for	4545
is	4384

2-gram Frequency
Content	Frequency
of_the	2053
in_the	1836
to_the	1030
on_the	946
for_the	790
to_be	694
and_the	591
at_the	588
in_a	547
with_the	496

3-gram Frequency
Content	Frequency
one_of_the	172
a_lot_of	159
out_of_the	86
to_be_a	76
it_was_a	71
some_of_the	70
going_to_be	69
be_able_to	60
the_end_of	57
as_well_as	56

We will now create bar plots of the 50 most occuring tokens in 1,2,3 grams tokens model.

1. Top 50 1-gram words

Top 50 2-gram words

Top 50 3-gram words

Next Steps

After exploring the corpus and looking at the most frequent 1,2,3 tokens we notice that there are tokens which have very low frequeny which will add very little significance to the model. Hence such low frequency tokens can be removed from the 1,2,3 N-gram models to save memory and increase performance.

Predicting the Next Word

After successfully cleaning and building N-grams, next word prediction model is developed based on the Kate Back-off algorithm. Here are the the steps to implement the Back-Off model.

Clean the user input which is the sequence of words by using same techniques used to clean the training data sets.
Extract last three or last two or the last one word depending upon the number of words given by the user.
If there is no match in 4-gram, back-off to 3-gram. Match the last two words of the user input with the first two words of 3-gram. If the last two words of the user input is found in the 3-gram, the third word in the 3-gram will be predicted word.
If there is no match in 3-gram, back-off to 2-gram. Match the last word of the user input with the first word of 2-gram. If the last word of the user input is found in the 2-gram, the second word in the 2-gram will be predicted word.
If there is no match in 2-gram, back-off to 1-gram. the most frequent word from 1-gram will be the predicted word.