This document presents the analysis of datasets in order to create a text predictor application that uses a prediction model based on input of one to three words by user. This is a brief presentation designed to highlight key ideas and points to managers who are not from data science background. The basis for the prediction model is using natural language processing and text mining techniques.
Data for this project is obtained from SwiftKey (see Reference below for link) and contains 3 types of textual data - from Twitter, blogs and news, each available in four languages. For the purpose of this project, data for English language is used. The dataset will be sampled and the sample will processed into a collection of text (known as a corpus). The corpus is cleaned and then manipulated using natural language processing (NLP) techniques, including creating statistical probabilities of the word occurences (using tokenization). Finally a model is proposed for the text prediction.
For simplicity, most of the codes used to generate this documents will be not be displayed. However, it can be accessed at Github for verification purposes.
datatwitter <-readLines(conn <-file("./data/en_US.twitter.txt",encoding="UTF-8"))
close(conn)
datablog <- readLines(conn <-file("./data/en_US.blogs.txt",encoding="UTF-8"))
close(conn)
datanews <- readLines(conn <-file("./data/en_US.news.txt","rb",encoding="UTF-8"))
close(conn)
The file size, object size (in memory) and number of lines in each dataset is quite big as shown below. The average words per line is also reflective of the type of dataset. Blogs, even with least lines, have most number of words as it is not subject to any limitations compared to news or twitter data. Twitter with its limit of 140 characters per tweet naturally has lower average compared to blog and news.
When combined, the total lines are over 4 million with more than 102 million words. Average words is 24 which is influenced by Twitter dataset that has most lines but least words.
Due to large data size and computational limitation, the number of unique words could not be obtained for the full dataset. However, it will be calculated for the sample dataset later.
| File Size (MB) | Object Size | Lines | Number of Words | Mean Number of Words | |
|---|---|---|---|---|---|
| 163.189 | 301.4 Mb | 2,360,148 | 30,093,369 | 12.75063 | |
| Blogs | 205.234 | 248.5 Mb | 899,288 | 37,546,246 | 41.75108 |
| News | 200.988 | 249.6 Mb | 1,010,242 | 35,010,782 | 34.65584 |
| Combined | NA | 799.4 Mb | 4,269,678 | 102,650,397 | 24.04172 |
For the purpose of this project, data cleansing will be done after creating a corpus (a text document) from the sample. Removal of empty lines (or null values, NAs) ore replacing certain values (like zero or other integers) would not be neccessary at this stage as it does not affect the prediction model.
Due to computational resource limitation, a sample of 1% of dataset is used. This is done by getting 1% from each dataset and then combining them, rather than getting the sample directly from the combined set. This is to mimic the actual data representation by the three datasets. A seed value is set to ensure reproducibility.
The sample data is further encoded into ASCII to ensure any unreadable characters are removed.
#set seed for reproducibility
set.seed(1234)
#create sample 1%
sampletwt <- sample(datatwitter,round(0.01*length(datatwitter)))
sampleblog <- sample(datablog,round(0.01*length(datablog)))
samplenews <- sample(datanews,round(0.01*length(datanews)))
allsample <- c(sampletwt,sampleblog,samplenews)
#clean to ensure proper encoding. removes all the weird, unintelligible word or characters.
sampletwt <- iconv(sampletwt, 'UTF-8', 'ASCII', "byte")
sampleblog <- iconv(sampleblog, 'UTF-8', 'ASCII', "byte")
samplenews <- iconv(samplenews, 'UTF-8', 'ASCII', "byte")
allsample <- iconv(allsample, 'UTF-8', 'ASCII', "byte")
Simple exploration of the data, similar to the full dataset, is done. Comparatively, the mean number of words for samples’ are approximately same to the full datasets: Twitter (12.94759 vs 12.75063), blog (44.99444 vs 41.75108), news (35.73461 vs 34.65584) and combined sample (25.08905 vs 24.04172).
However, in terms of unique words, all three samples have higher percentage of unique words, but when combined, the percentage dropped. This is expected as there will be a significant number of duplicate words when the datasets are combined.
| Object Size | Lines | Number of Words | Mean Number of Words | Unique Words | % Unique Words | |
|---|---|---|---|---|---|---|
| 3.1 Mb | 23,601 | 305,576 | 12.94759 | 32,889 | 10.76295 | |
| Blogs | 2.6 Mb | 8,993 | 404,635 | 44.99444 | 34,209 | 8.454286 |
| News | 2.5 Mb | 10,102 | 360,991 | 35.73461 | 34,897 | 9.667 |
| Combined | 8.2 Mb | 42,696 | 1,071,202 | 25.08905 | 67,457 | 6.297318 |
Wordcloud of the individual sample datasets and combined sample datasets is created. The top 50 words is displayed.
The sample dataset is converted to a corpus format using tm package. A corpus is basically collection of text. This conversion will enable further analysis of the dataset.
The following steps are done in cleaning up the corpus:
| Order | Step |
|---|---|
| 1 | Changing all words to lower case |
| 2 | Removing punctuations |
| 3 | Removing numbers |
| 4 | Removing profanities |
| 5 | Removing extra whitespaces |
No stemming (defined as reducing words to their root words or singularity, e.g. buyer -> buy, dogs -> dog) is done as it may affect word predictions. Similarly, stopwords (words that are common in certain language, such as “the”, “a”, “is” etc) is also not removed as these feature commonly in text and removal will affect the prediction modelling.
The list of profanities is retrieved from a github repository (listed in Reference).
The commands to clean the corpus is as below:
allsample.corpus <- tm_map(allsample.corpus, content_transformer(tolower))
allsample.corpus <- tm_map(allsample.corpus,removePunctuation)
allsample.corpus <- tm_map(allsample.corpus,removeNumbers)
allsample.corpus <- tm_map(allsample.corpus, removeWords,plist)
allsample.corpus <- tm_map(allsample.corpus,stripWhitespace)
allsample.corpus <- tm_map(allsample.corpus,PlainTextDocument)
Once some cleaning and processing is done on the corpus, we proceed to create a term-document matrix (TDM), which is basically a matrix (or table) that maps the occurence of the text (arranged as rows) against the number of documents (arranged as columns). Documents here will refer to each line of the dataset (i.e. a line may be a tweet, a blog article or a news article). The occurence of the term is calculated by how many previous words we want to consider in order to predict the next word. This is known as n-gram tokenization, whereby “n” can be 1, 2, 3 and so on. An n-gram is a contiguous sequence of n items from a given sequence of text or speech (the corpus). A single word gram is known as an unigram. A unigram TDM entry can be occurence of the word “book” in all the documents. A two word gram is known as bigram. a bigram TDM entry can be probability occurence of the term “read book” in the documents. A three word gram is a trigram. Example would be “to read book” term probability occurence in documents. And so on for fourgram, fivegram and other grams.
For the purpose of this project, we look at up to fivegram TDMs.
Commands to set token:
library(RWeka)
u_token <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_token <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_token <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
four_token <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
five_token <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))
Creation of TermDocumentMatrix:
tdm1 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=u_token))
tdm2 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=bi_token))
tdm3 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=tri_token))
tdm4 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=four_token))
tdm5 <- TermDocumentMatrix(allsample.corpus, control=list(tokenize=five_token))
It is possible than many of the entries in TDMs consist of zero or ones, meaning words or terms that a very rare. We can further reduce the size of the TDM by removing these entries. By using inspect(), it is found that all the five TDMs have 100% sparsity. However, the unigram dan bigram TDMs can be reduced using RemoveSparseTerms while the trigram, fourgram and fivegrams, having too little non-sparse/sparse ratio ends up with TDM with zero entries. Thus for these 3 TDMs (trigram,fourgram, and fivegram), a filtering of terms with at least certain frequency of occurence is done. trigram is filtered by frequency of at least 10, fourgram by frequency of at least 5 and fivegram by frequency of at least 3.
The reason for sparsing and filtering is also due to memory limitation of the computer, whereby converting the raw TDMs to matrix generates very large object sizes. By sparsing or filtering, the size is reduced for plotting purposes.
#check sparsity
#inspect(tdm1)
#inspect(tdm2)
#inspect(tdm3)
#inspect(tdm4)
#inspect(tdm5)
#remove sparse terms
tdm1a <- removeSparseTerms(tdm1,0.99)
tdm2a <- removeSparseTerms(tdm2,0.99)
#convert to matrix
tdm1matrix <- as.matrix(tdm1a)
tdm2matrix <- as.matrix(tdm2a)
#get terms, frequency and convert to data frame
tdm1freq <- sort(rowSums(tdm1matrix),decreasing=TRUE)
tdm1df <- data.frame(term=names(tdm1freq),frequency =tdm1freq)
tdm2freq <- sort(rowSums(tdm2matrix),decreasing=TRUE)
tdm2df <- data.frame(term=names(tdm2freq),frequency =tdm2freq)
#for trigram, fourgram and fivegram, the matrix is too big, thus have to select subset and then convert to data frame.
tdm3a <- findFreqTerms(tdm3, lowfreq = 10)
tdm3matrix <- as.matrix(tdm3[tdm3a,])
tdm3freq <- sort(rowSums(tdm3matrix), decreasing = TRUE)
tdm3df <- data.frame(term=names(tdm3freq), frequency=tdm3freq)
tdm4a <- findFreqTerms(tdm4, lowfreq = 5)
tdm4matrix <- as.matrix(tdm4[tdm4a,])
tdm4freq <- sort(rowSums(tdm4matrix), decreasing = TRUE)
tdm4df <- data.frame(term=names(tdm4freq), frequency=tdm4freq)
tdm5a <- findFreqTerms(tdm5, lowfreq = 3)
tdm5matrix <- as.matrix(tdm5[tdm5a,])
tdm5freq <- sort(rowSums(tdm5matrix), decreasing = TRUE)
tdm5df <- data.frame(term=names(tdm5freq), frequency=tdm5freq)
Below are sample data from the TDMs, showing the top 10 elements in each of the n-grams:
| term | frequency | |
|---|---|---|
| the | the | 47831 |
| and | and | 24304 |
| for | for | 11067 |
| that | that | 10525 |
| you | you | 9332 |
| with | with | 7301 |
| was | was | 6454 |
| this | this | 5539 |
| have | have | 5271 |
| are | are | 5023 |
knitr::kable(head(tdm2df,10))
| term | frequency | |
|---|---|---|
| of the | of the | 4389 |
| in the | in the | 4211 |
| to the | to the | 2294 |
| for the | for the | 2041 |
| on the | on the | 1954 |
| to be | to be | 1584 |
| at the | at the | 1409 |
| and the | and the | 1262 |
| in a | in a | 1239 |
| with the | with the | 1109 |
knitr::kable(head(tdm3df,10))
| term | frequency | |
|---|---|---|
| one of the | one of the | 394 |
| a lot of | a lot of | 275 |
| thanks for the | thanks for the | 243 |
| out of the | out of the | 174 |
| i want to | i want to | 162 |
| to be a | to be a | 158 |
| going to be | going to be | 156 |
| the end of | the end of | 151 |
| as well as | as well as | 147 |
| it was a | it was a | 146 |
knitr::kable(head(tdm4df,10))
| term | frequency | |
|---|---|---|
| at the end of | at the end of | 84 |
| the end of the | the end of the | 83 |
| the rest of the | the rest of the | 72 |
| for the first time | for the first time | 66 |
| thanks for the follow | thanks for the follow | 59 |
| is going to be | is going to be | 51 |
| at the same time | at the same time | 49 |
| one of the most | one of the most | 44 |
| when it comes to | when it comes to | 42 |
| is one of the | is one of the | 41 |
knitr::kable(head(tdm5df,10))
| term | frequency | |
|---|---|---|
| at the end of the | at the end of the | 43 |
| the north dakota township map | the north dakota township map | 23 |
| for the first time in | for the first time in | 19 |
| in the middle of the | in the middle of the | 17 |
| happy mothers day to all | happy mothers day to all | 13 |
| thank you so much for | thank you so much for | 13 |
| by the end of the | by the end of the | 10 |
| i cant wait to see | i cant wait to see | 10 |
| to be a part of | to be a part of | 10 |
| for the rest of the | for the rest of the | 9 |
#clean memory
rm(tdm1a,tdm1freq,tdm1matrix,tdm2a,tdm2freq,tdm2matrix,tdm3a,tdm3freq,tdm3matrix,tdm4a,tdm4freq,tdm4matrix,tdm5a,tdm5freq,tdm5matrix)
rm(tdm1,tdm2,tdm3,tdm4,tdm5)
Wordcloud plots are created to show the top 50 terms for each of the TDM and also compared with top 50 terms for the corpus.
wordcloud(tdm1df$term,tdm1df$frequency,min.freq=200,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))
wordcloud(tdm2df$term,tdm2df$frequency,min.freq=200,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))
wordcloud(tdm3df$term,tdm3df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))
wordcloud(tdm4df$term,tdm4df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))
wordcloud(tdm5df$term,tdm5df$frequency,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))
wordcloud(allsample.corpus,max.words = 50, random.color = TRUE, random.order = FALSE, colors = brewer.pal(12,"Set3"))
We further explore using histograms to see the trend of the word/term occurences. The graphs below display top 30 terms for each of the TDMs.
library(ggplot2)
g1 <- ggplot(head(tdm1df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() + theme_gray() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
labs(title = "Top 30 Unigrams by Frequency")
print(g1)
For unigram plot above, as expected, common English words dominate the list. This is due to not cleaning the corpus using the stopwords.
g2 <- ggplot(head(tdm2df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() + theme_gray() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
labs(title = "Top 30 Bigrams by Frequency")
print(g2)
g3 <- ggplot(head(tdm3df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() + theme_gray() +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
labs(title = "Top 30 Trigrams by Frequency")
print(g3)
g4 <- ggplot(head(tdm4df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() + theme_gray() +
theme(legend.title=element_blank()) +
xlab("Fourgram") + ylab("Frequency") +
labs(title = "Top 30 Fourgrams by Frequency")
print(g4)
g5 <- ggplot(head(tdm5df,30), aes(x=reorder(term, frequency), y=frequency, fill=frequency)) +
geom_bar(stat = "identity") + coord_flip() + theme_gray() +
theme(legend.title=element_blank()) +
xlab("Fivegram") + ylab("Frequency") +
labs(title = "Top 30 Fivegrams by Frequency")
print(g5)
The next step is to create a prediction model based on the n-gram tokenizations. We need to decide on using 1,2, o 3 (or higher level n-gram) for the model. For the purpose of this project, a 2 or 3 gram model is proposed.
We also need to consider how to handles text input that does not match the n-grams. This is highly likely as a person may enter series of text that doesn’t match any entries in the corpus. This may involve smoothing and backoff models.
We also need to consider how to improve the efficiency and accuracy of the model, with regards to limitations of device memory (RAM) and processing time, especially on mobile devices.
In the end, the application need to make use of a small model that is reasonably accurate and fast, as a tradeoff due to computing hardware limitations.