Milestone Report

Overview

Coursera has partnered with SwiftKey for the Capstone Project in the Data Science Track, where we will be creating a predictive text application that will attempt to guess the next word in a sentence as you type it. To create our predictive language models, we have been supplied with a data set containing a large volume of text collected from twitter, blogs and news sources. A basic approach for a predicting the next word in a sentence is to look at the last few words preceding the word we are trying to predict, then looking in the data set for other occurances for this word combination to see what words typically follow. With this in mind, this initial exploration will focus on breaking down the collection of text in to word combinations - so called ngrams.

Summary

First, a brief summary of the files we we have

File	Approx Size	Line Count	Word count
en_us.blogs.txt	205 MB	899 288	37 334 131
en_us.news.txt	201 MB	1 010 242	34 372 530
en_us.twitter.txt	163 MB	2 360 148	30 373 583

Samples

Due to the size of the dataset, I use 10% of each file for exploratory analysis. To avoid the risk of a biased sample, I randomly chose each line using the binomial distribution - essentially flipping a coin for each line and selecting it for my sample if it came up heads, only the coin was biased so there was only a 10% chance of getting heads. This means each sample has roughly 10% of the size, line and word count of the original counterpart, selected at random.

Data preparation

To make the data easier to analyze, I do a few basic transformations such as convert it to all lower case, and remove extra whitespace and punctuation. Once this id done, I load it in to a document term matrix which is a large grid with counts of words that occur.

myCorpus <- Corpus(DirSource("./final/samples/"))


myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, PlainTextDocument)


dtm <- DocumentTermMatrix(myCorpus)

Ngrams

To prepare for the analysis, I tokenized the entire corpus - that is, split it up in to separate words, and then created collections of n-grams. An n-gram is a combination of n-words, so a 2-gram is a collection of all 2 word combinations that appear in the sample. Generally, the longer the n-gram the more specific meaning they will convey, but the frequency will go down. The goal is to eventually use these ngrams in a predictive model.

Here’s an overview of the counts of unique ngrams for each level. As can be intuitevely expected, there are more unique combinations of words than there are single words.

ggplot(countsdf, aes(ngram, count)) + geom_bar(stat="identity")

plot of chunk unnamed-chunk-4

In my opinion 3-grams offer the best tradeoff between meaning and frequency, so let’s have a look at the 15 most common 3-grams and their distribution:

ggplot(fr, aes(word,freq)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

plot of chunk unnamed-chunk-5

Conclusion

This brief analysis is meant to give a quick, reasonably non-technical overview of the corpus we have access to.