Coursera has partnered with SwiftKey for the Capstone Project in the Data Science Track, where we will be creating a predictive text application that will attempt to guess the next word in a sentence as you type it. To create our predictive language models, we have been supplied with a data set containing a large volume of text collected from twitter, blogs and news sources. A basic approach for a predicting the next word in a sentence is to look at the last few words preceding the word we are trying to predict, then looking in the data set for other occurances for this word combination to see what words typically follow. With this in mind, this initial exploration will focus on breaking down the collection of text in to word combinations - so called ngrams.
First, a brief summary of the files we we have
| File | Approx Size | Line Count | Word count |
|---|---|---|---|
| en_us.blogs.txt | 205 MB | 899 288 | 37 334 131 |
| en_us.news.txt | 201 MB | 1 010 242 | 34 372 530 |
| en_us.twitter.txt | 163 MB | 2 360 148 | 30 373 583 |
Due to the size of the dataset, I use 10% of each file for exploratory analysis. To avoid the risk of a biased sample, I randomly chose each line using the binomial distribution - essentially flipping a coin for each line and selecting it for my sample if it came up heads, only the coin was biased so there was only a 10% chance of getting heads. This means each sample has roughly 10% of the size, line and word count of the original counterpart, selected at random.
To make the data easier to analyze, I do a few basic transformations such as convert it to all lower case, and remove extra whitespace and punctuation. Once this id done, I load it in to a document term matrix which is a large grid with counts of words that occur.
myCorpus <- Corpus(DirSource("./final/samples/"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
dtm <- DocumentTermMatrix(myCorpus)
To prepare for the analysis, I tokenized the entire corpus - that is, split it up in to separate words, and then created collections of n-grams. An n-gram is a combination of n-words, so a 2-gram is a collection of all 2 word combinations that appear in the sample. Generally, the longer the n-gram the more specific meaning they will convey, but the frequency will go down. The goal is to eventually use these ngrams in a predictive model.
Here’s an overview of the counts of unique ngrams for each level. As can be intuitevely expected, there are more unique combinations of words than there are single words.
ggplot(countsdf, aes(ngram, count)) + geom_bar(stat="identity")
In my opinion 3-grams offer the best tradeoff between meaning and frequency, so let’s have a look at the 15 most common 3-grams and their distribution:
ggplot(fr, aes(word,freq)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
This brief analysis is meant to give a quick, reasonably non-technical overview of the corpus we have access to.