Introduction

This Capstone project will be held in collaboration with SwiftKey, a well known company specialized on Natural Language Processing algorithms and apps used widely in mobile devices. The main goal of this Capstone project is to create an algorithm to predict next possible words while typing a text fragment into an input field as many people may know while using their mobile devices. Because on this devices exist a limit in amount of storage and RAM it is not a good idea to have huge databases to predict next words. Instead well performing algorithms will be used.

This Milstone Report should give a short overview and some exploratory results about our traing data set. This training data set can be obtained from this link(SwiftKey.zip)

Data acquisition and cleaning

After downloadinng the data from the URL given above and loading the text files into the project we are ready for a first look at our data set.
The english texts will be used for the exploratory analysis because it is more common to understand if it comes to develop the Shiny app. Furthermore a data sample with an amount of 5% from every text file will be used. This is necessary to reduce the time needed for pre-processing and cleaning the data as well as tokenizing the words of a corpus into different n-grams. A next point we have to be aware of is R’s limitation to the available RAM. The function to sample the data sets looks like this:

###
# samples the given data and returns
# a percenatge amount of it.
# @data:        The data which we want to be sampled (a vector)
# @percentage:  How many percent of the data do we approx. need?
# allowed values: 0..1 (i.e. 0.1 ~ 10%)
###
sampleTexts <- function(data, percentage)
{
    sample.size <- ceiling(length(data) * percentage)[1]
    sampled_entries <- sample(data, sample.size, replace = FALSE)
    return(sampled_entries)
}

After sampling ~5% from each data set, we merge it into a text and create a vector containing the three texts:

# sample the texts
sample_blog <- sampleTexts(en_blog_data, 0.05)
sample_twitter <- sampleTexts(en_twitter_data, 0.05)
sample_news <- sampleTexts(en_news_data, 0.05)
# paste it together 
en_blog_full_text = paste(sample_blog, collapse = " ")
en_twitter_full_text = paste(sample_twitter, collapse = " ")
en_news_full_text = paste(sample_news, collapse = " ")

full_data <- c(en_blog_full_text, en_twitter_full_text, en_news_full_text)

Creating a corpus and cleanup the data

Now we create a document corpus. For this task we use the {tm}-R-package. There exists a function named tm_map which can apply various transformations on a given corpus. With the help of this function we remove unused characters from out corpus like whitespace, puntuation and also numbers.
At the time of writing this report i think numbers are not useful for this early stage of word prediction so they get removed too. Stop words like the, or and will not been removed because they seem to be useful in predicting next words.

# create corpus
doc_vec <- VectorSource(full_data)
doc_corpus <- VCorpus(doc_vec)

# cleanup data
toEmpty <- content_transformer(function(x, pattern) gsub(pattern, "", x))
doc_corpus <- tm_map(doc_corpus, removePunctuation)
doc_corpus <- tm_map(doc_corpus, removeNumbers)
doc_corpus <- tm_map(doc_corpus, stripWhitespace)
doc_corpus <- tm_map(doc_corpus, content_transformer(tolower))
# there are still some characterts left we not need
# so use gsub to filter them out too
doc_corpus <- tm_map(doc_corpus, toEmpty, "[^[:alpha:][:space:]]")

Switching to n-grams

Wikipdedia says:

…an n-gram is a contiguous sequence of n items from a given sequence of text or speech

They can be used to build an n-gram-model for predicting next words and therefor fits our needs to accomplish the task of creating an app predicting next words. So we will create bi-grams and tri-grams. If we see later that also four-grams are useful we will construct them. But at the moment we will focus on bi- and trigrams.

Exploratory analysis

Let’s take a deeper look into our data and see what we have.

Summary of our sampled data

At this point we show the summary of the original and sampled data sets. Here the count of lines of each data set (original and sampled) is shown as well as the word count of our samples:

	blog data	twitter data	news data
line count, original data	899288	2360148	1010242
line count, sampled data set	44965	118008	50513
word count, sampled data set	71624	69464	71266

It’s easy to see that the line count of the sampled data is ~5% of it’s original.

Profanity filtering

I will not provide a profanity filter at this moment. But it is planed for the final shiny app.

The Term-Document Matrix

After we have cleaned up our data and created the corpus we now Constructs a term-document matrix. This matrix have the documents of the given corpus in it’s columns and the terms of the whole corpus in it’s rows. For each document exists a counter how often this term appears in this document. The generated matrix is a sparse matrix. That means i.e. some terms only occurs once in one document and we have many 0-values in the other documents for this term. Our term-document matrix has the following form:

tdm <- TermDocumentMatrix(doc_corpus)
tdm

## <<TermDocumentMatrix (terms: 142415, documents: 3)>>
## Non-/sparse entries: 210346/216899
## Sparsity           : 51%
## Maximal term length: 252
## Weighting          : term frequency (tf)

As you can see we have a sparsity of 51% and the longest word consists of 252 chars what by the way is to much.

Let’s inspect 10 rows of the term-document matrix to see how them looks like.

inspect(tdm[1997:2007, 1:3])

## <<TermDocumentMatrix (terms: 11, documents: 3)>>
## Non-/sparse entries: 16/17
## Sparsity           : 52%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## 
##                    Docs
## Terms                 1   2   3
##   agrans              0   0   1
##   agrarian            0   0   1
##   agree             130 321 115
##   agreeable           0   1   1
##   agreeably           0   1   0
##   agreeance           1   0   0
##   agreed             91 128 176
##   agreedand           0   1   0
##   agreedplusice       0   1   0
##   agreedupon          0   0   2
##   agreeexperiencing   0   1   0

The word ‘agree’ appears moderatly often in every three documents (blogs, twitter and news). For further processing we will remove some sparse entries to gain some space and create a matrix we will use in the next step.

tdm_common = removeSparseTerms(tdm, 0.5)
tdm_dense <- as.matrix(tdm_common)

Visualization of word frequency

In this section we want to visualize word frequencies of single words and bi- and tri-grams to adept what are the most occured words. Let’s start with a simple word cloud of single words.

Word frequencies of single words

# sort our data set
tdm_dense_sorted <- sort(rowSums(tdm_dense),decreasing=TRUE)
df_tdm_sorted <- data.frame(word = names(tdm_dense_sorted),freq=tdm_dense_sorted)
# create word cloud
wordcloud(df_tdm_sorted$word,df_tdm_sorted$freq, scale=c(4,0.8), min.freq = 5, max.words=100, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

As someone can see words like ‘the’, ‘and’, ‘that’, ‘you’ appears very often. Some of the most occuring words are so called stop words. We didn’t remove them because thee could play an importent role in next word prediction as already mentioned above.

Now we prepare our corpus and do a bar plot out of the top 20 words for each document type. We applied a log scale onto the y-axis to get a better feeling for the values.

# melt the data together and measure the count of each single word
# in each document
tdm_dense_matrix <- melt(tdm_dense, value.name = "count")
tdm_dense_matrix <- tdm_dense_matrix[with(tdm_dense_matrix, order(Docs, count, decreasing = TRUE)), ]

# extract the top 20 for each document and paste it together
top_twenty <- tdm_dense_matrix %>%
    group_by(Docs) %>%
    arrange(desc(count)) %>%
    slice(1:20)
top_twenty$Docs[top_twenty$Docs == 1] <- 'blog'
top_twenty$Docs[top_twenty$Docs == 2] <- 'twitter'
top_twenty$Docs[top_twenty$Docs == 3] <- 'news'
top_twenty$Docs <- as.factor(top_twenty$Docs)

# do a bar plot of the top 20 words grouped by document type
ggplot(top_twenty, aes(x=Terms, y=count, fill=Docs)) +
    geom_bar(stat="Identity") +
    scale_y_continuous( trans = "log10")+
    xlab("Top 20 terms") +
    ylab("Count of terms") +
    ggtitle(paste("Count of Top-20 terms sampled \n from blog-, news- and twitter-data")) +    
    theme(axis.text.x=element_text(angle=45, hjust=1))

Word frequencies of n-grams

Next we will visualize the frequency of n-grams in a similar matter than the single words. For this report only bi-grams and tri-grams will be covered because they are the most used ones.
To extract n-grams another approach wil be used. We used a tokenizer which could be found here. The reason for that is the tokenizer which should be used before from the ‘RWeka’-R-package is to slow.

But now let’s start with a bar plot of the Top-20 bi-grams to be at the bottom of our blog-, news- and twitter data-sets (conrete this means 5% of data from each of our data sets).

Bi-grams

df.bi.grams <- readRDS("bi_grams_full.rds")
top_twenty_bigrams <- df.bi.grams %>%
    group_by(category) %>%
    arrange(desc(count)) %>%
    slice(1:20)
top_twenty_bigrams$category[top_twenty_bigrams$category == 1] <- 'blog'
top_twenty_bigrams$category[top_twenty_bigrams$category == 2] <- 'twitter'
top_twenty_bigrams$category[top_twenty_bigrams$category == 3] <- 'news'
top_twenty_bigrams$category <- as.factor(top_twenty_bigrams$category)
ggplot(top_twenty_bigrams, aes(x = term, y = count, fill = category)) +
    geom_bar(stat="Identity") +
    xlab("Top 20 Bi-gram terms") +
    ylab("Count of Bi-gram terms") +
    ggtitle(paste("Count of Top-20 Bi-grams sampled \n from blog-, news- and twitter-data")) +
    coord_flip()

As someone can see bi-gram terms like ‘of the’ and ‘in the’ appers very often.

Tri-grams

And here we present the same plot for the top 20 tri-grams:

Tri-gram terms like ‘one of the’ and ‘a lot of’ are some ‘winners’ here.

Word coverage

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

To accomplish this we will iterate over all sorted and summarized single words of our 5%-sample corpus and stop if we reached a limit. This limit is the given coverage times our total word count. The iterated number (i in that case) gives us the number of words we will need to cover the given percentage.

This is the word count for 50% and 90% coverage:

## Word coverage for 50 percent:  140

## Word coverage for 90 percent:  7777

Now let’s see in a plot how the coverage distribution could look like:

ggplot(data=df.coverage, aes(x=xlab, y=word_count)) +
    geom_line() +
    geom_point() +
    scale_x_continuous(breaks = xlab) +
    geom_text(aes(label=word_count), hjust = 1.2, vjust = -0.4) +
    geom_vline(xintercept = 50, color="red") +
    geom_hline(aes(yintercept=140), color="red") +
    geom_vline(xintercept = 90, color="blue") +
    geom_hline(aes(yintercept=7777), color="blue") +
    xlab("Coverage in percent") +
    ylab("word count") +
    ggtitle("Word count to cover all word instances")

It can be seen that our word count coverage is not linear. It increases exponentially with increasing percentage. For every 10% between 10% and 60% the amount of needed words approximately doubles and for higher percentages it needs more than twice as many.

Increasing the word coverage

We have to find a trade-off here. Maybe we didn’t need a full coverage so we can resign most unpopular words. This would us save some memory too. Another strategy could be stemming of words.

Interesting findings

the news data set was not encoded in UTF-8
the news data set consists of some characters which R couldn’t process so i removed them in a pre-cleaning step without R
in the dictionary there are some meaningless words like ‘aaaaaaaarrgghhh’, ‘aaaaa’ and so on we need to check if they are useful or not

What to do next?

The next steps for this project could look like as follows:

add pofanity filtering to the cleanup step
optimize preprocessing and cleaning
What to do with pretty usless words (like ‘aaaarrgghhh’)?
create a better model out of n-grams for prediction; what level of n-grams do we realy need?
implement smooting techniques like Katz-backoff or Kneser-Ney to predict words which are unseen before
train the model with a sampled data set
trim it for performance, size (memory neede) and accuracy
build the shiny app

Conclusion

Natural language processing is a complete new topic for me. So my approach to analyze things may have some inconsistencies. One other thing to mention is that english is not my mother language so maybe my style writing this report is’nt perfect at all and sometimes a bit harder to understand. In summary i learned a lot in the past days/weeks/month and i know it needs some time and a lot of practice to sharpen my analytical skills. I have a lot of fun engaging myself with this cool analytical stuff.

References

Github gist of a simple ngram tokenizer, https://github.com/zero323/r-snippets/blob/master/R/ngram_tokenizer.R

wikipedia, article about n-grams http://en.wikipedia.org/wiki/N-gram

The Stanford Natural Language Processing Group, http://nlp.stanford.edu/

Coursera, Natural Language Processing, Michael Collins, Columbia University, 2013, https://class.coursera.org/nlangp-001

Coursera, Natural Language Processing, Dan Jurafsky, Christopher Manning, Stanford University, https://www.coursera.org/course/nlp

Prototype ML/NLP Code: Tutorial Series, www.thoughtly.co, http://www.thoughtly.co/blog/category/mlnlp-tutorial-series/

Thomas (Data Science Addictive from Germany)

Coursera Capstone Project on NLP, Milestone Report

Thomas Guenther

March 2015