Introduction

The purpose of this project is to develop a NLP model that can simulate the Swift-key word prediction. For this purpose have 3 sources of text corpus to work with viz. twitter, news and blogs. To build a successful NLP model we will work in folowing stages.

In this report we will focus only on the Preprocessing and Exploratory analysis of the data.

Preprocessing and Cleaning Data

Installing the required packages

required.packages <- c("ggplot2", "quanteda", "readr", "stringr", "kableExtra")

for (pakg in required.packages) {
  if (pakg %in% installed.packages()){
    require(pakg, character.only = TRUE)
  } else {
    install.packages(pakg)
    require(pakg, character.only = TRUE)
  }
}

There are 3 sources for textual data provided in the link here. After unzipping the file we get 3 text files of Twitter, news and blogs data.

Loading the data

#Creating connection to Twitter data from Swift Key Corp.
twitterData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")


#Creating connection to News data from Swift Key Corp.
newsData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.news.txt")


#Creating connection to blog data from Swift Key Corp.
blogData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")

Take a Peek

After successfully loading the data into the R session, we take a cursory glance at the data.

Words Charaters Lines Memory
News 35624454 203223159 1010242 269840992
Blogs 38309620 206824505 899288 267758632
Twitter 31003545 162096248 2360150 334485064

As we can see that the amount of data to be processed is huge, so we take sample data (about 5000 lines) from the raw data to create our corpus.

Creating sample corpus

# Creating sample data from Twitter, News and blogs

nlines <- 5000

tweets <- sample(twitterData, nlines, replace = FALSE)
news <- sample(newsData, nlines, replace = FALSE)
blogs <- sample(blogData, nlines, replace = FALSE)

# Removing the original datasets from the memory

rm(twitterData, newsData, blogData)
gc()
##           used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 2031050 108.5    7324825  391.2   7143677  381.6
## Vcells 8042456  61.4  170555061 1301.3 213061023 1625.6

Lets explore the sampled data and see whether we can work with this corpus.

Words Charaters Lines Memory
News 176461 1006025 5000 1335680
Blogs 209057 1127267 5000 1465464
Twitter 65008 340709 5000 714856

The sampled corpus is of ideal size and consumes less memeory to process. We then combine all the sources of data to create a single large combined corpus to make 1,2,3,4 tokens clea tokens

Cleaning Corpus & Creating tokens using QUANTEDA

To create a tooken object and clean the tokens we will use R’s Quanteda package against the more popular tm and tidytext packages for text minnig. The quanteda pakage supports multi-threading and is generally fast compared to the tm package.

The following text cleaning is performed.

I have not removed the stop words and also stemmed words as we will be comparing the user inputs (as is) with the most frequently used tokens. Doing so might lead to loss of information and might lead to incorrect word prediciton.

## Creating clean tokens
# Tokenize tweets messages.

# combining the sample data sets to make a Corpus
sample.corpus <- as.character(c(tweets, news, blogs))

sample.tokens <- tokens(sample.corpus, what = "word", 
                       remove_numbers = TRUE, remove_punct = TRUE,
                       remove_symbols = TRUE, remove_hyphens = TRUE,
                       remove_separators = TRUE,remove_twitter = TRUE, 
                       remove_url = TRUE)
                       
# Lower case the tokens.
sample.tokens <- tokens_tolower(sample.tokens)

# Remove profane words
profane.words <- read_lines("./Coursera-SwiftKey/final/en_US/profane.txt")
sample.tokens <- tokens_remove(sample.tokens, profane.words)

As a start, the NLP model we will use for predicting text is the N-gram model that is based on the concept of a Markov chain. Put in simpler terms, we will compute probability for one word to come after a sentence fragment with the assumption that this probability is dependent only on the n-1 preceding words. For example if chossing n=3 (tri-gram) and my sentence fragment is: “I had breakfast this.”, the probability of the next word being “morning” is dependent only on the two preceding words: “breakfast” & “this”. For dealing with the sparsity of text data, we will use Katz’s Back-off approach as well as Good-Turing smoothing.

Create 1,2,3,4 grams from the Tokens

# Creating N-gram tokens
sample.1gram <- tokens_ngrams(sample.tokens, n = 1)
sample.2gram <- tokens_ngrams(sample.tokens, n = 2)
sample.3gram <- tokens_ngrams(sample.tokens, n = 3)

# Create our first bag-of-words model.
tokens.1gram.dfm <- dfm(sample.1gram, tolower = FALSE)
tokens.2gram.dfm <- dfm(sample.2gram, tolower = FALSE)
tokens.3gram.dfm <- dfm(sample.3gram, tolower = FALSE)


# Converting into data frame

dfm_to_df <- function(x) {
     x.df <- data.frame(Content = featnames(x), Frequency = colSums(x), 
                 stringsAsFactors = FALSE)
     x.df <- x.df[with(x.df, order(-Frequency)),]
     row.names(x.df) <- NULL
     return(x.df)
}

tokens.1gram.df <- dfm_to_df(tokens.1gram.dfm)
tokens.2gram.df <- dfm_to_df(tokens.2gram.dfm)
tokens.3gram.df <- dfm_to_df(tokens.3gram.dfm)

Exploring the Corpus

The top 5 most frequently used tokens(1,2,3) is tabled below.

1-gram Frequency
Content Frequency
the 21633
to 12061
and 11290
a 10715
of 9396
in 7387
i 6492
that 4654
for 4545
is 4384
2-gram Frequency
Content Frequency
of_the 2053
in_the 1836
to_the 1030
on_the 946
for_the 790
to_be 694
and_the 591
at_the 588
in_a 547
with_the 496
3-gram Frequency
Content Frequency
one_of_the 172
a_lot_of 159
out_of_the 86
to_be_a 76
it_was_a 71
some_of_the 70
going_to_be 69
be_able_to 60
the_end_of 57
as_well_as 56

We will now create bar plots of the 50 most occuring tokens in 1,2,3 grams tokens model.

1. Top 50 1-gram words

Top 50 2-gram words

Top 50 3-gram words

Next Steps

After exploring the corpus and looking at the most frequent 1,2,3 tokens we notice that there are tokens which have very low frequeny which will add very little significance to the model. Hence such low frequency tokens can be removed from the 1,2,3 N-gram models to save memory and increase performance.

Predicting the Next Word

After successfully cleaning and building N-grams, next word prediction model is developed based on the Kate Back-off algorithm. Here are the the steps to implement the Back-Off model.