Introduction

This document presents the analysis of datasets in order to create a text predictor application that uses a prediction model based on input of one to three words by user. This is a brief presentation designed to highlight key ideas and points to managers who are not from data science background. The basis for the prediction model is using natural language processing and text mining techniques.

Data for this project is obtained from SwiftKey (see Reference below for link) and contains 3 types of textual data - from Twitter, blogs and news, each available in four languages. For the purpose of this project, data for English language is used. The dataset will be sampled and the sample will processed into a collection of text (known as a corpus). The corpus is cleaned and then manipulated using natural language processing (NLP) techniques, including creating statistical probabilities of the word occurences (using tokenization). Finally a model is proposed for the text prediction.

For simplicity, most of the codes used to generate this documents will be not be displayed. However, it can be accessed at Github for verification purposes.

Data Loading

datatwitter <-readLines(conn <-file("./data/en_US.twitter.txt",encoding="UTF-8"))
close(conn)
datablog <- readLines(conn <-file("./data/en_US.blogs.txt",encoding="UTF-8"))
close(conn)
datanews <- readLines(conn <-file("./data/en_US.news.txt","rb",encoding="UTF-8"))
close(conn)

Data Exploration

The file size, object size (in memory) and number of lines in each dataset is quite big as shown below. The average words per line is also reflective of the type of dataset. Blogs, even with least lines, have most number of words as it is not subject to any limitations compared to news or twitter data. Twitter with its limit of 140 characters per tweet naturally has lower average compared to blog and news.

When combined, the total lines are over 4 million with more than 102 million words. Average words is 24 which is influenced by Twitter dataset that has most lines but least words.

Due to large data size and computational limitation, the number of unique words could not be obtained for the full dataset. However, it will be calculated for the sample dataset later.

File Size (MB) Object Size Lines Number of Words Mean Number of Words
Twitter 163.189 301.4 Mb 2,360,148 30,093,369 12.75063
Blogs 205.234 248.5 Mb 899,288 37,546,246 41.75108
News 200.988 249.6 Mb 1,010,242 35,010,782 34.65584
Combined NA 799.4 Mb 4,269,678 102,650,397 24.04172

Data Preprocessing

For the purpose of this project, data cleansing will be done after creating a corpus (a text document) from the sample. Removal of empty lines (or null values, NAs) ore replacing certain values (like zero or other integers) would not be neccessary at this stage as it does not affect the prediction model.

Sample Creation

Due to computational resource limitation, a sample of 1% of dataset is used. This is done by getting 1% from each dataset and then combining them, rather than getting the sample directly from the combined set. This is to mimic the actual data representation by the three datasets. A seed value is set to ensure reproducibility.

The sample data is further encoded into ASCII to ensure any unreadable characters are removed.

#set seed for reproducibility
set.seed(1234)

#create sample 1%
sampletwt <- sample(datatwitter,round(0.01*length(datatwitter)))
sampleblog <- sample(datablog,round(0.01*length(datablog)))
samplenews <- sample(datanews,round(0.01*length(datanews)))
allsample <- c(sampletwt,sampleblog,samplenews)

#clean to ensure proper encoding. removes all the weird, unintelligible word or characters.
sampletwt <- iconv(sampletwt, 'UTF-8', 'ASCII', "byte")
sampleblog <- iconv(sampleblog, 'UTF-8', 'ASCII', "byte")
samplenews <- iconv(samplenews, 'UTF-8', 'ASCII', "byte")
allsample <- iconv(allsample, 'UTF-8', 'ASCII', "byte")

Sample Data Exploration

Simple exploration of the data, similar to the full dataset, is done. Comparatively, the mean number of words for samples’ are approximately same to the full datasets: Twitter (12.94759 vs 12.75063), blog (44.99444 vs 41.75108), news (35.73461 vs 34.65584) and combined sample (25.08905 vs 24.04172).

However, in terms of unique words, all three samples have higher percentage of unique words, but when combined, the percentage dropped. This is expected as there will be a significant number of duplicate words when the datasets are combined.

Object Size Lines Number of Words Mean Number of Words Unique Words % Unique Words
Twitter 3.1 Mb 23,601 305,576 12.94759 32,889 10.76295
Blogs 2.6 Mb 8,993 404,635 44.99444 34,209 8.454286
News 2.5 Mb 10,102 360,991 35.73461 34,897 9.667
Combined 8.2 Mb 42,696 1,071,202 25.08905 67,457 6.297318

Sample Data Visualization

Wordcloud of the individual sample datasets and combined sample datasets is created. The top 50 words is displayed.

Wordcloud for Tweet sample dataset

Wordcloud for blog sample dataset

Wordcloud for news sample dataset

Wordcloud for combined sample dataset

Creating Corpus

The sample dataset is converted to a corpus format using tm package. A corpus is basically collection of text. This conversion will enable further analysis of the dataset.

Corpus Cleaning

The following steps are done in cleaning up the corpus:

Order Step
1 Changing all words to lower case
2 Removing punctuations
3 Removing numbers
4 Removing profanities
5 Removing extra whitespaces

No stemming (defined as reducing words to their root words or singularity, e.g. buyer -> buy, dogs -> dog) is done as it may affect word predictions. Similarly, stopwords (words that are common in certain language, such as “the”, “a”, “is” etc) is also not removed as these feature commonly in text and removal will affect the prediction modelling.

Dictionary

The list of profanities is retrieved from a github repository (listed in Reference).

The commands to clean the corpus is as below:

allsample.corpus <- tm_map(allsample.corpus, content_transformer(tolower))
allsample.corpus <- tm_map(allsample.corpus,removePunctuation)
allsample.corpus <- tm_map(allsample.corpus,removeNumbers)
allsample.corpus <- tm_map(allsample.corpus, removeWords,plist)
allsample.corpus <- tm_map(allsample.corpus,stripWhitespace)
allsample.corpus <- tm_map(allsample.corpus,PlainTextDocument)

n-gram Tokenization

Once some cleaning and processing is done on the corpus, we proceed to create a term-document matrix (TDM), which is basically a matrix (or table) that maps the occurence of the text (arranged as rows) against the number of documents (arranged as columns). Documents here will refer to each line of the dataset (i.e. a line may be a tweet, a blog article or a news article). The occurence of the term is calculated by how many previous words we want to consider in order to predict the next word. This is known as n-gram tokenization, whereby “n” can be 1, 2, 3 and so on. An n-gram is a contiguous sequence of n items from a given sequence of text or speech (the corpus). A single word gram is known as an unigram. A unigram TDM entry can be occurence of the word “book” in all the documents. A two word gram is known as bigram. a bigram TDM entry can be probability occurence of the term “read book” in the documents. A three word gram is a trigram. Example would be “to read book” term probability occurence in documents. And so on for fourgram, fivegram and other grams.

For the purpose of this project, we look at up to fivegram TDMs.

Commands to set token:

library(RWeka)
u_token <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_token <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_token <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
four_token <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
five_token <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))

Creation of TermDocumentMatrix:

tdm1 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=u_token))
tdm2 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=bi_token))
tdm3 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=tri_token))
tdm4 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=four_token))
tdm5 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=five_token))

Sparsing and Further Filtering

It is possible than many of the entries in TDMs consist of zero or ones, meaning words or terms that a very rare. We can further reduce the size of the TDM by removing these entries. By using inspect(), it is found that all the five TDMs have 100% sparsity. However, the unigram dan bigram TDMs can be reduced using RemoveSparseTerms while the trigram, fourgram and fivegrams, having too little non-sparse/sparse ratio ends up with TDM with zero entries. Thus for these 3 TDMs (trigram,fourgram, and fivegram), a filtering of terms with at least certain frequency of occurence is done. trigram is filtered by frequency of at least 10, fourgram by frequency of at least 5 and fivegram by frequency of at least 3.

The reason for sparsing and filtering is also due to memory limitation of the computer, whereby converting the raw TDMs to matrix generates very large object sizes. By sparsing or filtering, the size is reduced for plotting purposes.

#check sparsity
#inspect(tdm1)
#inspect(tdm2)
#inspect(tdm3)
#inspect(tdm4)
#inspect(tdm5)
#remove sparse terms
tdm1a <- removeSparseTerms(tdm1,0.99)
tdm2a <- removeSparseTerms(tdm2,0.99)

#convert to matrix
tdm1matrix <- as.matrix(tdm1a)
tdm2matrix <- as.matrix(tdm2a)

#get terms, frequency and convert to data frame
tdm1freq <- sort(rowSums(tdm1matrix),decreasing=TRUE)
tdm1df <- data.frame(term=names(tdm1freq),frequency =tdm1freq)
tdm2freq <- sort(rowSums(tdm2matrix),decreasing=TRUE)
tdm2df <- data.frame(term=names(tdm2freq),frequency =tdm2freq)

#for trigram, fourgram and fivegram, the matrix is too big, thus have to select subset and then convert to data frame.
tdm3a <- findFreqTerms(tdm3, lowfreq = 10)
tdm3matrix <- as.matrix(tdm3[tdm3a,])
tdm3freq <- sort(rowSums(tdm3matrix), decreasing = TRUE)
tdm3df <- data.frame(term=names(tdm3freq), frequency=tdm3freq)
tdm4a <- findFreqTerms(tdm4, lowfreq = 5)
tdm4matrix <- as.matrix(tdm4[tdm4a,])
tdm4freq <- sort(rowSums(tdm4matrix), decreasing = TRUE)
tdm4df <- data.frame(term=names(tdm4freq), frequency=tdm4freq)
tdm5a <- findFreqTerms(tdm5, lowfreq = 3)
tdm5matrix <- as.matrix(tdm5[tdm5a,])
tdm5freq <- sort(rowSums(tdm5matrix), decreasing = TRUE)
tdm5df <- data.frame(term=names(tdm5freq), frequency=tdm5freq)

Sample Data

Below are sample data from the TDMs, showing the top 10 elements in each of the n-grams:

Unigram

term frequency
the the 47831
and and 24304
for for 11067
that that 10525
you you 9332
with with 7301
was was 6454
this this 5539
have have 5271
are are 5023

Bigram

knitr::kable(head(tdm2df,10)) 
term frequency
of the of the 4389
in the in the 4211
to the to the 2294
for the for the 2041
on the on the 1954
to be to be 1584
at the at the 1409
and the and the 1262
in a in a 1239
with the with the 1109

Trigram

knitr::kable(head(tdm3df,10)) 
term frequency
one of the one of the 394
a lot of a lot of 275
thanks for the thanks for the 243
out of the out of the 174
i want to i want to 162
to be a to be a 158
going to be going to be 156
the end of the end of 151
as well as as well as 147
it was a it was a 146

Fourgram

knitr::kable(head(tdm4df,10)) 
term frequency
at the end of at the end of 84
the end of the the end of the 83
the rest of the the rest of the 72
for the first time for the first time 66
thanks for the follow thanks for the follow 59
is going to be is going to be 51
at the same time at the same time 49
one of the most one of the most 44
when it comes to when it comes to 42
is one of the is one of the 41

Fivegram

knitr::kable(head(tdm5df,10)) 
term frequency
at the end of the at the end of the 43
the north dakota township map the north dakota township map 23
for the first time in for the first time in 19
in the middle of the in the middle of the 17
happy mothers day to all happy mothers day to all 13
thank you so much for thank you so much for 13
by the end of the by the end of the 10
i cant wait to see i cant wait to see 10
to be a part of to be a part of 10
for the rest of the for the rest of the 9
#clean memory
rm(tdm1a,tdm1freq,tdm1matrix,tdm2a,tdm2freq,tdm2matrix,tdm3a,tdm3freq,tdm3matrix,tdm4a,tdm4freq,tdm4matrix,tdm5a,tdm5freq,tdm5matrix)
rm(tdm1,tdm2,tdm3,tdm4,tdm5)

Words Frequency Visualization

Wordcloud plots are created to show the top 50 terms for each of the TDM and also compared with top 50 terms for the corpus.

Wordcloud for unigram

wordcloud(tdm1df$term,tdm1df$frequency,min.freq=200,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for bigram

wordcloud(tdm2df$term,tdm2df$frequency,min.freq=200,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for trigram

wordcloud(tdm3df$term,tdm3df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for fourgram

wordcloud(tdm4df$term,tdm4df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for fivegram

wordcloud(tdm5df$term,tdm5df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for Corpus

wordcloud(allsample.corpus,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Histogram

We further explore using histograms to see the trend of the word/term occurences. The graphs below display top 30 terms for each of the TDMs.

library(ggplot2)
g1 <- ggplot(head(tdm1df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() +  theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Unigram") + ylab("Frequency") +
  labs(title = "Top 30 Unigrams by Frequency")
print(g1)

For unigram plot above, as expected, common English words dominate the list. This is due to not cleaning the corpus using the stopwords.

g2 <- ggplot(head(tdm2df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() +  theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Bigram") + ylab("Frequency") +
  labs(title = "Top 30 Bigrams by Frequency")
print(g2)

g3 <- ggplot(head(tdm3df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() + theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Trigram") + ylab("Frequency") + 
  labs(title = "Top 30 Trigrams by Frequency")
print(g3)

g4 <- ggplot(head(tdm4df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() + theme_gray() +
  theme(legend.title=element_blank()) +
  xlab("Fourgram") + ylab("Frequency") +
  labs(title = "Top 30 Fourgrams by Frequency")
print(g4)

g5 <- ggplot(head(tdm5df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() +  theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Fivegram") + ylab("Frequency") +
  labs(title = "Top 30 Fivegrams by Frequency")
print(g5)

Way Forward

The next step is to create a prediction model based on the n-gram tokenizations. We need to decide on using 1,2, o 3 (or higher level n-gram) for the model. For the purpose of this project, a 2 or 3 gram model is proposed.

We also need to consider how to handles text input that does not match the n-grams. This is highly likely as a person may enter series of text that doesn’t match any entries in the corpus. This may involve smoothing and backoff models.

We also need to consider how to improve the efficiency and accuracy of the model, with regards to limitations of device memory (RAM) and processing time, especially on mobile devices.

In the end, the application need to make use of a small model that is reasonably accurate and fast, as a tradeoff due to computing hardware limitations.

References

  1. Data source from SwiftKey - download zip file
  2. Dataset Corpus (HC Corpora) - Read me file
  3. Source codes - Github site
  4. Blacklisted words - Shutterstock Github Site (raw)