Introduction

This document presents the analysis of datasets in order to create a text predictor application that uses a prediction model based on input of one to three words by user. This is a brief presentation designed to highlight key ideas and points to managers who are not from data science background. The basis for the prediction model is using natural language processing and text mining techniques.

Data for this project is obtained from SwiftKey (see Reference below for link) and contains 3 types of textual data - from Twitter, blogs and news, each available in four languages. For the purpose of this project, data for English language is used. The dataset will be sampled and the sample will processed into a collection of text (known as a corpus). The corpus is cleaned and then manipulated using natural language processing (NLP) techniques, including creating statistical probabilities of the word occurences (using tokenization). Finally a model is proposed for the text prediction.

For simplicity, most of the codes used to generate this documents will be not be displayed. However, it can be accessed at Github for verification purposes.

Data Loading

datatwitter <-readLines(conn <-file("./data/en_US.twitter.txt",encoding="UTF-8"))
close(conn)
datablog <- readLines(conn <-file("./data/en_US.blogs.txt",encoding="UTF-8"))
close(conn)
datanews <- readLines(conn <-file("./data/en_US.news.txt","rb",encoding="UTF-8"))
close(conn)

Data Exploration

The file size, object size (in memory) and number of lines in each dataset is quite big as shown below. The average words per line is also reflective of the type of dataset. Blogs, even with least lines, have most number of words as it is not subject to any limitations compared to news or twitter data. Twitter with its limit of 140 characters per tweet naturally has lower average compared to blog and news.

When combined, the total lines are over 4 million with more than 102 million words. Average words is 24 which is influenced by Twitter dataset that has most lines but least words.

Due to large data size and computational limitation, the number of unique words could not be obtained for the full dataset. However, it will be calculated for the sample dataset later.

	File Size (MB)	Object Size	Lines	Number of Words	Mean Number of Words
Twitter	163.189	301.4 Mb	2,360,148	30,093,369	12.75063
Blogs	205.234	248.5 Mb	899,288	37,546,246	41.75108
News	200.988	249.6 Mb	1,010,242	35,010,782	34.65584
Combined	NA	799.4 Mb	4,269,678	102,650,397	24.04172

Data Preprocessing

For the purpose of this project, data cleansing will be done after creating a corpus (a text document) from the sample. Removal of empty lines (or null values, NAs) ore replacing certain values (like zero or other integers) would not be neccessary at this stage as it does not affect the prediction model.

Sample Creation

Due to computational resource limitation, a sample of 1% of dataset is used. This is done by getting 1% from each dataset and then combining them, rather than getting the sample directly from the combined set. This is to mimic the actual data representation by the three datasets. A seed value is set to ensure reproducibility.

The sample data is further encoded into ASCII to ensure any unreadable characters are removed.

#set seed for reproducibility
set.seed(1234)

#create sample 1%
sampletwt <- sample(datatwitter,round(0.01*length(datatwitter)))
sampleblog <- sample(datablog,round(0.01*length(datablog)))
samplenews <- sample(datanews,round(0.01*length(datanews)))
allsample <- c(sampletwt,sampleblog,samplenews)

#clean to ensure proper encoding. removes all the weird, unintelligible word or characters.
sampletwt <- iconv(sampletwt, 'UTF-8', 'ASCII', "byte")
sampleblog <- iconv(sampleblog, 'UTF-8', 'ASCII', "byte")
samplenews <- iconv(samplenews, 'UTF-8', 'ASCII', "byte")
allsample <- iconv(allsample, 'UTF-8', 'ASCII', "byte")

Sample Data Exploration

Simple exploration of the data, similar to the full dataset, is done. Comparatively, the mean number of words for samples’ are approximately same to the full datasets: Twitter (12.94759 vs 12.75063), blog (44.99444 vs 41.75108), news (35.73461 vs 34.65584) and combined sample (25.08905 vs 24.04172).

However, in terms of unique words, all three samples have higher percentage of unique words, but when combined, the percentage dropped. This is expected as there will be a significant number of duplicate words when the datasets are combined.

	Object Size	Lines	Number of Words	Mean Number of Words	Unique Words	% Unique Words
Twitter	3.1 Mb	23,601	305,576	12.94759	32,889	10.76295
Blogs	2.6 Mb	8,993	404,635	44.99444	34,209	8.454286
News	2.5 Mb	10,102	360,991	35.73461	34,897	9.667
Combined	8.2 Mb	42,696	1,071,202	25.08905	67,457	6.297318

Sample Data Visualization

Wordcloud of the individual sample datasets and combined sample datasets is created. The top 50 words is displayed.

Wordcloud for Tweet sample dataset

Wordcloud for blog sample dataset

Wordcloud for news sample dataset

Wordcloud for combined sample dataset

Creating Corpus

The sample dataset is converted to a corpus format using tm package. A corpus is basically collection of text. This conversion will enable further analysis of the dataset.

Corpus Cleaning

The following steps are done in cleaning up the corpus:

Order	Step
1	Changing all words to lower case
2	Removing punctuations
3	Removing numbers
4	Removing profanities
5	Removing extra whitespaces

No stemming (defined as reducing words to their root words or singularity, e.g. buyer -> buy, dogs -> dog) is done as it may affect word predictions. Similarly, stopwords (words that are common in certain language, such as “the”, “a”, “is” etc) is also not removed as these feature commonly in text and removal will affect the prediction modelling.

Dictionary

The list of profanities is retrieved from a github repository (listed in Reference).

The commands to clean the corpus is as below:

allsample.corpus <- tm_map(allsample.corpus, content_transformer(tolower))
allsample.corpus <- tm_map(allsample.corpus,removePunctuation)
allsample.corpus <- tm_map(allsample.corpus,removeNumbers)
allsample.corpus <- tm_map(allsample.corpus, removeWords,plist)
allsample.corpus <- tm_map(allsample.corpus,stripWhitespace)
allsample.corpus <- tm_map(allsample.corpus,PlainTextDocument)

n-gram Tokenization

Once some cleaning and processing is done on the corpus, we proceed to create a term-document matrix (TDM), which is basically a matrix (or table) that maps the occurence of the text (arranged as rows) against the number of documents (arranged as columns). Documents here will refer to each line of the dataset (i.e. a line may be a tweet, a blog article or a news article). The occurence of the term is calculated by how many previous words we want to consider in order to predict the next word. This is known as n-gram tokenization, whereby “n” can be 1, 2, 3 and so on. An n-gram is a contiguous sequence of n items from a given sequence of text or speech (the corpus). A single word gram is known as an unigram. A unigram TDM entry can be occurence of the word “book” in all the documents. A two word gram is known as bigram. a bigram TDM entry can be probability occurence of the term “read book” in the documents. A three word gram is a trigram. Example would be “to read book” term probability occurence in documents. And so on for fourgram, fivegram and other grams.

For the purpose of this project, we look at up to fivegram TDMs.

Commands to set token:

library(RWeka)
u_token <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_token <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_token <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
four_token <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
five_token <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))

Creation of TermDocumentMatrix:

tdm1 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=u_token))
tdm2 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=bi_token))
tdm3 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=tri_token))
tdm4 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=four_token))
tdm5 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=five_token))

Sparsing and Further Filtering

It is possible than many of the entries in TDMs consist of zero or ones, meaning words or terms that a very rare. We can further reduce the size of the TDM by removing these entries. By using inspect(), it is found that all the five TDMs have 100% sparsity. However, the unigram dan bigram TDMs can be reduced using RemoveSparseTerms while the trigram, fourgram and fivegrams, having too little non-sparse/sparse ratio ends up with TDM with zero entries. Thus for these 3 TDMs (trigram,fourgram, and fivegram), a filtering of terms with at least certain frequency of occurence is done. trigram is filtered by frequency of at least 10, fourgram by frequency of at least 5 and fivegram by frequency of at least 3.

The reason for sparsing and filtering is also due to memory limitation of the computer, whereby converting the raw TDMs to matrix generates very large object sizes. By sparsing or filtering, the size is reduced for plotting purposes.

#check sparsity
#inspect(tdm1)
#inspect(tdm2)
#inspect(tdm3)
#inspect(tdm4)
#inspect(tdm5)

#remove sparse terms
tdm1a <- removeSparseTerms(tdm1,0.99)
tdm2a <- removeSparseTerms(tdm2,0.99)

#convert to matrix
tdm1matrix <- as.matrix(tdm1a)
tdm2matrix <- as.matrix(tdm2a)

#get terms, frequency and convert to data frame
tdm1freq <- sort(rowSums(tdm1matrix),decreasing=TRUE)
tdm1df <- data.frame(term=names(tdm1freq),frequency =tdm1freq)
tdm2freq <- sort(rowSums(tdm2matrix),decreasing=TRUE)
tdm2df <- data.frame(term=names(tdm2freq),frequency =tdm2freq)

#for trigram, fourgram and fivegram, the matrix is too big, thus have to select subset and then convert to data frame.
tdm3a <- findFreqTerms(tdm3, lowfreq = 10)
tdm3matrix <- as.matrix(tdm3[tdm3a,])
tdm3freq <- sort(rowSums(tdm3matrix), decreasing = TRUE)
tdm3df <- data.frame(term=names(tdm3freq), frequency=tdm3freq)
tdm4a <- findFreqTerms(tdm4, lowfreq = 5)
tdm4matrix <- as.matrix(tdm4[tdm4a,])
tdm4freq <- sort(rowSums(tdm4matrix), decreasing = TRUE)
tdm4df <- data.frame(term=names(tdm4freq), frequency=tdm4freq)
tdm5a <- findFreqTerms(tdm5, lowfreq = 3)
tdm5matrix <- as.matrix(tdm5[tdm5a,])
tdm5freq <- sort(rowSums(tdm5matrix), decreasing = TRUE)
tdm5df <- data.frame(term=names(tdm5freq), frequency=tdm5freq)

Sample Data

Below are sample data from the TDMs, showing the top 10 elements in each of the n-grams:

Unigram

	term	frequency
the	the	47831
and	and	24304
for	for	11067
that	that	10525
you	you	9332
with	with	7301
was	was	6454
this	this	5539
have	have	5271
are	are	5023

Bigram

knitr::kable(head(tdm2df,10))

	term	frequency
of the	of the	4389
in the	in the	4211
to the	to the	2294
for the	for the	2041
on the	on the	1954
to be	to be	1584
at the	at the	1409
and the	and the	1262
in a	in a	1239
with the	with the	1109

Trigram

knitr::kable(head(tdm3df,10))

	term	frequency
one of the	one of the	394
a lot of	a lot of	275
thanks for the	thanks for the	243
out of the	out of the	174
i want to	i want to	162
to be a	to be a	158
going to be	going to be	156
the end of	the end of	151
as well as	as well as	147
it was a	it was a	146

Fourgram

knitr::kable(head(tdm4df,10))

	term	frequency
at the end of	at the end of	84
the end of the	the end of the	83
the rest of the	the rest of the	72
for the first time	for the first time	66
thanks for the follow	thanks for the follow	59
is going to be	is going to be	51
at the same time	at the same time	49
one of the most	one of the most	44
when it comes to	when it comes to	42
is one of the	is one of the	41

Fivegram

knitr::kable(head(tdm5df,10))

	term	frequency
at the end of the	at the end of the	43
the north dakota township map	the north dakota township map	23
for the first time in	for the first time in	19
in the middle of the	in the middle of the	17
happy mothers day to all	happy mothers day to all	13
thank you so much for	thank you so much for	13
by the end of the	by the end of the	10
i cant wait to see	i cant wait to see	10
to be a part of	to be a part of	10
for the rest of the	for the rest of the	9

#clean memory
rm(tdm1a,tdm1freq,tdm1matrix,tdm2a,tdm2freq,tdm2matrix,tdm3a,tdm3freq,tdm3matrix,tdm4a,tdm4freq,tdm4matrix,tdm5a,tdm5freq,tdm5matrix)
rm(tdm1,tdm2,tdm3,tdm4,tdm5)

Words Frequency Visualization

Wordcloud plots are created to show the top 50 terms for each of the TDM and also compared with top 50 terms for the corpus.

Wordcloud for unigram

wordcloud(tdm1df$term,tdm1df$frequency,min.freq=200,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for bigram

wordcloud(tdm2df$term,tdm2df$frequency,min.freq=200,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for trigram

wordcloud(tdm3df$term,tdm3df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for fourgram

wordcloud(tdm4df$term,tdm4df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for fivegram

wordcloud(tdm5df$term,tdm5df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Wordcloud for Corpus

wordcloud(allsample.corpus,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))

Histogram

We further explore using histograms to see the trend of the word/term occurences. The graphs below display top 30 terms for each of the TDMs.

library(ggplot2)
g1 <- ggplot(head(tdm1df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() +  theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Unigram") + ylab("Frequency") +
  labs(title = "Top 30 Unigrams by Frequency")
print(g1)

For unigram plot above, as expected, common English words dominate the list. This is due to not cleaning the corpus using the stopwords.

g2 <- ggplot(head(tdm2df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() +  theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Bigram") + ylab("Frequency") +
  labs(title = "Top 30 Bigrams by Frequency")
print(g2)

g3 <- ggplot(head(tdm3df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() + theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Trigram") + ylab("Frequency") + 
  labs(title = "Top 30 Trigrams by Frequency")
print(g3)

g4 <- ggplot(head(tdm4df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() + theme_gray() +
  theme(legend.title=element_blank()) +
  xlab("Fourgram") + ylab("Frequency") +
  labs(title = "Top 30 Fourgrams by Frequency")
print(g4)

g5 <- ggplot(head(tdm5df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
  geom_bar(stat = "identity") +  coord_flip() +  theme_gray() + 
  theme(legend.title=element_blank()) +
  xlab("Fivegram") + ylab("Frequency") +
  labs(title = "Top 30 Fivegrams by Frequency")
print(g5)

Way Forward

The next step is to create a prediction model based on the n-gram tokenizations. We need to decide on using 1,2, o 3 (or higher level n-gram) for the model. For the purpose of this project, a 2 or 3 gram model is proposed.

We also need to consider how to handles text input that does not match the n-grams. This is highly likely as a person may enter series of text that doesn’t match any entries in the corpus. This may involve smoothing and backoff models.

We also need to consider how to improve the efficiency and accuracy of the model, with regards to limitations of device memory (RAM) and processing time, especially on mobile devices.

In the end, the application need to make use of a small model that is reasonably accurate and fast, as a tradeoff due to computing hardware limitations.

References

Data source from SwiftKey - download zip file
Dataset Corpus (HC Corpora) - Read me file
Source codes - Github site
Blacklisted words - Shutterstock Github Site (raw)

Milestone Report - Capstone Project

M Poobalan

March 18, 2016