The purpose of this project is to develop a NLP model that can simulate the Swift-key word prediction. For this purpose have 3 sources of text corpus to work with viz. twitter, news and blogs. To build a successful NLP model we will work in folowing stages.
In this report we will focus only on the Preprocessing and Exploratory analysis of the data.
Installing the required packages
required.packages <- c("ggplot2", "quanteda", "readr", "stringr", "kableExtra")
for (pakg in required.packages) {
if (pakg %in% installed.packages()){
require(pakg, character.only = TRUE)
} else {
install.packages(pakg)
require(pakg, character.only = TRUE)
}
}
There are 3 sources for textual data provided in the link here. After unzipping the file we get 3 text files of Twitter, news and blogs data.
Loading the data
#Creating connection to Twitter data from Swift Key Corp.
twitterData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
#Creating connection to News data from Swift Key Corp.
newsData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
#Creating connection to blog data from Swift Key Corp.
blogData <- read_lines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
Take a Peek
After successfully loading the data into the R session, we take a cursory glance at the data.
| Words | Charaters | Lines | Memory | |
|---|---|---|---|---|
| News | 35624454 | 203223159 | 1010242 | 269840992 |
| Blogs | 38309620 | 206824505 | 899288 | 267758632 |
| 31003545 | 162096248 | 2360150 | 334485064 |
As we can see that the amount of data to be processed is huge, so we take sample data (about 5000 lines) from the raw data to create our corpus.
Creating sample corpus
# Creating sample data from Twitter, News and blogs
nlines <- 5000
tweets <- sample(twitterData, nlines, replace = FALSE)
news <- sample(newsData, nlines, replace = FALSE)
blogs <- sample(blogData, nlines, replace = FALSE)
# Removing the original datasets from the memory
rm(twitterData, newsData, blogData)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2031050 108.5 7324825 391.2 7143677 381.6
## Vcells 8042456 61.4 170555061 1301.3 213061023 1625.6
Lets explore the sampled data and see whether we can work with this corpus.
| Words | Charaters | Lines | Memory | |
|---|---|---|---|---|
| News | 176461 | 1006025 | 5000 | 1335680 |
| Blogs | 209057 | 1127267 | 5000 | 1465464 |
| 65008 | 340709 | 5000 | 714856 |
The sampled corpus is of ideal size and consumes less memeory to process. We then combine all the sources of data to create a single large combined corpus to make 1,2,3,4 tokens clea tokens
Cleaning Corpus & Creating tokens using QUANTEDA
To create a tooken object and clean the tokens we will use R’s Quanteda package against the more popular tm and tidytext packages for text minnig. The quanteda pakage supports multi-threading and is generally fast compared to the tm package.
The following text cleaning is performed.
I have not removed the stop words and also stemmed words as we will be comparing the user inputs (as is) with the most frequently used tokens. Doing so might lead to loss of information and might lead to incorrect word prediciton.
## Creating clean tokens
# Tokenize tweets messages.
# combining the sample data sets to make a Corpus
sample.corpus <- as.character(c(tweets, news, blogs))
sample.tokens <- tokens(sample.corpus, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE,
remove_separators = TRUE,remove_twitter = TRUE,
remove_url = TRUE)
# Lower case the tokens.
sample.tokens <- tokens_tolower(sample.tokens)
# Remove profane words
profane.words <- read_lines("./Coursera-SwiftKey/final/en_US/profane.txt")
sample.tokens <- tokens_remove(sample.tokens, profane.words)
As a start, the NLP model we will use for predicting text is the N-gram model that is based on the concept of a Markov chain. Put in simpler terms, we will compute probability for one word to come after a sentence fragment with the assumption that this probability is dependent only on the n-1 preceding words. For example if chossing n=3 (tri-gram) and my sentence fragment is: “I had breakfast this.”, the probability of the next word being “morning” is dependent only on the two preceding words: “breakfast” & “this”. For dealing with the sparsity of text data, we will use Katz’s Back-off approach as well as Good-Turing smoothing.
Create 1,2,3,4 grams from the Tokens
# Creating N-gram tokens
sample.1gram <- tokens_ngrams(sample.tokens, n = 1)
sample.2gram <- tokens_ngrams(sample.tokens, n = 2)
sample.3gram <- tokens_ngrams(sample.tokens, n = 3)
# Create our first bag-of-words model.
tokens.1gram.dfm <- dfm(sample.1gram, tolower = FALSE)
tokens.2gram.dfm <- dfm(sample.2gram, tolower = FALSE)
tokens.3gram.dfm <- dfm(sample.3gram, tolower = FALSE)
# Converting into data frame
dfm_to_df <- function(x) {
x.df <- data.frame(Content = featnames(x), Frequency = colSums(x),
stringsAsFactors = FALSE)
x.df <- x.df[with(x.df, order(-Frequency)),]
row.names(x.df) <- NULL
return(x.df)
}
tokens.1gram.df <- dfm_to_df(tokens.1gram.dfm)
tokens.2gram.df <- dfm_to_df(tokens.2gram.dfm)
tokens.3gram.df <- dfm_to_df(tokens.3gram.dfm)
The top 5 most frequently used tokens(1,2,3) is tabled below.
| Content | Frequency |
|---|---|
| the | 21633 |
| to | 12061 |
| and | 11290 |
| a | 10715 |
| of | 9396 |
| in | 7387 |
| i | 6492 |
| that | 4654 |
| for | 4545 |
| is | 4384 |
| Content | Frequency |
|---|---|
| of_the | 2053 |
| in_the | 1836 |
| to_the | 1030 |
| on_the | 946 |
| for_the | 790 |
| to_be | 694 |
| and_the | 591 |
| at_the | 588 |
| in_a | 547 |
| with_the | 496 |
| Content | Frequency |
|---|---|
| one_of_the | 172 |
| a_lot_of | 159 |
| out_of_the | 86 |
| to_be_a | 76 |
| it_was_a | 71 |
| some_of_the | 70 |
| going_to_be | 69 |
| be_able_to | 60 |
| the_end_of | 57 |
| as_well_as | 56 |
We will now create bar plots of the 50 most occuring tokens in 1,2,3 grams tokens model.
1. Top 50 1-gram words
Top 50 2-gram words
Top 50 3-gram words
After exploring the corpus and looking at the most frequent 1,2,3 tokens we notice that there are tokens which have very low frequeny which will add very little significance to the model. Hence such low frequency tokens can be removed from the 1,2,3 N-gram models to save memory and increase performance.
After successfully cleaning and building N-grams, next word prediction model is developed based on the Kate Back-off algorithm. Here are the the steps to implement the Back-Off model.