Milestone Report: Analyzing Text to Generate Insights into the Frequency of Word Combinations

Introduction

For this Data Science Milestone Report, I analyzed a corpus of text that includes twitter posts, news articles, and blog posts written in English. I sought to develop an understanding of how different words occur more often by themselves or in pairs or triples. The goal is to develop a data structure and algorithm to use in applications that predict what a user might want to type after they have already typed one or two words.

The data is provided by SwiftKey, the creators of predictive mobile keyboards. The corpus contains a variety of writing styles since it comes from twitter posts that usually have a very informal style and are limited to 140 characters to more formal and longer blog posts and news articles.

Data Download

We download the data and uncompress it for use.

# download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip', 'Coursera-SwiftKey.zip')
# unzip('Coursera-SwiftKey.zip')

Data Cleaning

I spent substantial time using several packages that were recommended in the class’s forums, including the tm, RWeka, text2vec and quanteda packages. Ultimately, I decided to use quanteda because it is more intuitive, performed faster and made better use of memory than the other packages. I was able to create n-grams through its dfm function using the whole dataset (the other packages hit the memory limitation unless I sampled the data to a fraction of the original).

The dfm (document-feature matrix) function in quanteda allowed me to work at a higher level than the other packages by abstracting the creation of n-grams, which are pairs or triplets of words that occur frequently on a corpus of text. By default, the dfm function does some pre-processing of the text, for example, by lower-casing and removing punctuation.

One strategy that made a huge difference in memory utilization was to provide the dfm function with a vector of the lines in the corpus by reading the files through readLines instead of feeding it the whole text as a corpus. This strategy prevented me from hitting memory limitations.

Data Analysis

First I extracted the paragraphs (lines) from the three sources and combined them into a single vector of lines. As mentioned above, dfm makes better use of memory when provided with a vector of paragraphs than with the whole corpus at once.

library(quanteda)

lines_twitter <- readLines('final/en_US/en_US.twitter.txt')
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 167155 appears
## to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 268547 appears
## to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1274086 appears
## to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt"): line 1759032 appears
## to contain an embedded nul
lines_blogs <- readLines('final/en_US/en_US.blogs.txt')
lines_news <- readLines('final/en_US/en_US.news.txt')

lines <- c(lines_twitter, lines_news, lines_blogs)

I decided to use the wc command line program available on most Unix-like systems for this report because it provided for a quick way to calculate word and line statistics about the corpus of data.

wordCount <- system('wc -lw final/en_US/*.txt', intern=TRUE)
wordCount
## [1] "  899288 37334690 final/en_US/en_US.blogs.txt"  
## [2] " 1010242 34372720 final/en_US/en_US.news.txt"   
## [3] " 2360148 30374206 final/en_US/en_US.twitter.txt"
## [4] " 4269678 102081616 total"

We can see that the entire corpus contains more than 100 million words and more than 4 million lines.

Unigrams

First, I want to get a sense of the most common words in the corpus. I’m goint to print a histogram of the top 20 n-grams on this and the next histograms.

I start by extract the unigrams (single words) from the corpus, then select the 20 most common words (features), and finally create a dataframe and graph it with ggplot as a histogram.

library(ggplot2)

unigrams <- quanteda::dfm(lines)
topFeatures <- topfeatures(unigrams, 20)

I use ggplot2::geom_bar to create a graph bar from the data, reorder the word counts and make the words appear on the “y-axis” through ggplot2::coord_flip in order to have a more visual top-to-bottom view of the most common words.

df <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df, aes(x=reorder(df$word, df$count), y=df$count)) + geom_bar(stat="identity") + coord_flip() + xlab('Count') + ylab('Words (unigrams)') + ggtitle('Most Common Unigrams Including Stop Words')

As we can see, the most common words are “stop words” in the English language. We probably would want to include these n-grams on a predictive typing application but we are going to explore unigrams that exclude these “stop words”. I invoke the dfm function with the ignoreFeatures parameter which can take words to be excluded while processing the data. quanteda conveniently provides a stopwords function with information from several languages.

Curiously enough, dfm can also take a boolean value for its removeTwitter parameter which handles the ‘#’ and ‘@’ characters that are used on twitter posts. This parameter does not make a difference when displaying the 20 most common n-grams but I include it for illustration purposes as it would become important on a more general typing-prediction application.

unigrams <- quanteda::dfm(lines, ngrams=1, ignoredFeatures=c(stopwords("english")), removeTwitter=TRUE)
topFeatures <- topfeatures(unigrams, 20)

df <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df, aes(x=reorder(df$word, df$count), y=df$count)) + geom_bar(stat="identity") + coord_flip() + xlab('Count') + ylab('Words (unigrams)') + ggtitle('Most Common Unigrams Excluding Stop Words')

When ignoring “stop words”, we can see that

Bigrams

The graph below, shows how the most common unigrams give rise to the most common bigrams.

bigrams_lines <- base::sample(lines, 2000000)
bigrams <- quanteda::dfm(bigrams_lines, ngrams=2, concatenator=" ", ignoredFeatures=c(stopwords("english")), removeTwitter=TRUE)
topFeatures <- topfeatures(bigrams, 20)
df <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df, aes(x=reorder(df$word, df$count), y=df$count)) + geom_bar(stat="identity") + coord_flip() + xlab('Count') + ylab('Words (bigrams)') + ggtitle('Most Common Bigrams Excluding Stop Words')

Trigrams

We can see that some of the trigrams result from some of the most popular bigrams from the histogram below.

trigrams_lines <- base::sample(lines, 1000000)
trigrams <- quanteda::dfm(trigrams_lines, ngrams=3, concatenator=" ", ignoredFeatures=c(stopwords("english")), removeTwitter=TRUE)
topFeatures <- topfeatures(trigrams, 20)
df <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df, aes(x=reorder(df$word, df$count), y=df$count)) + geom_bar(stat="identity") + coord_flip() + xlab('Count') + ylab('Words (trigrams)') + ggtitle('Most Common Trigrams Excluding Stop Words')

Coverage per Number of N-grams

In this section I investigate how many of the most common words can cover a percentage of the whole corpus.

For the unigrams, we see that we only need a dozen words in order to cover about 50% of the corpus, and about 10% of the most common unique words to cover about 90% of the corpus.

words <- topfeatures(unigrams, length(unigrams))
df <- data.frame(words)
df$n <- c(1:nrow(df))
df$total <- cumsum(df$words)
ggplot(df, aes(x =n, y =total)) + geom_line() + xlab('Number of Unique Words') + ylab('Coverage') + ggtitle('Coverage per Number of Unigrams')

For the bigrams, we already need about a third of the most common bigrams to cover about 50% of the corpus, and about 80% of the most common bigrams cover about 90% fo the corpus.

words <- topfeatures(bigrams, length(bigrams))
df <- data.frame(words)
df$n <- c(1:nrow(df))
df$total <- cumsum(df$words)
ggplot(df, aes(x =n, y =total)) + geom_line() + xlab('Number of Unique Words') + ylab('Coverage') + ggtitle('Coverage per Number of Bigrams')

The trigrams present a very differnt situation. There is almost a linear relationship between the percentage of the most common trigrams and their coverage of the corpus, except for the very top most common triagrams.

words <- topfeatures(trigrams, length(trigrams))
df <- data.frame(words)
df$n <- c(1:nrow(df))
df$total <- cumsum(df$words)
ggplot(df, aes(x =n, y =total)) + geom_line() + xlab('Number of Unique Words') + ylab('Coverage') + ggtitle('Coverage per Number of Trigrams')

Modeling

I have come with a better insight as to how to tackle the implementation of the final Shiny application based on the experience with this report.

It seems to me that we can probably get away with trigrams for presenting an helful way of selecting suggested words based on previous words typed. There doesn’t seem to be a need to include 4-grams or greater because their frequency might be so unique to render the 4-word or 5-word suggestions very unhelpful.

I could also improve the application by enhancing the bigrams and trigrams with synonyms and also by using the stems of the words instead of the whole word to improve the suggestions. Using stemming of words probably also cuts down on the amount of memory usage needed to maintain whatever data structure that is ultimately used.

The data structure used for suggestions might be a hash of list of hashes. This means that the keys of the parent hash are the first word and its value is a list of hashes for the second word or bigram. These lists would be ordered by the most popular second alternative for a bigram. Each of the hashes in the list is then the third word or trigram for the suggestions.

Performance

The memory usage was within the limits of my 16GB laptop while using the quanteda::dfm function when passed vectors of paragraphs (lines). For the final application the creation of the n-grams could be saved and loaded without the need of recreating the document-feature matrix every time.

The runtime was not too time consuming. I didn’t use any parallelization of multiple cores and the computation took no more than 30 minutes. It would be interesting to expore the benefits of parallelization but since the n-grams might only need to be created once for the application as described on the above paragraph, this might not be such a big concern.