Data Science Capstone - Milestone Report

Introduction

The report explores the SwiftKey input data set and presents a path to design a next word prediction algorithm. It presents an exploratory analysis and details the main goals of this project. It also explains the major features of the data and briefly summarizes the plans for creating the prediction algorithm and Shiny app.

Basic summaries of the data

The data set is composed by millions of words, extracted from blogs, news and twitter streams.

##     Lines     Words Characters                    File
## 1  899288  37334114  210160014   en_US/en_US.blogs.txt
## 2 1010242  34365936  205811889    en_US/en_US.news.txt
## 3 2360148  30359804  167105338 en_US/en_US.twitter.txt
## 4 4269678 102059854  583077241                   total

In order to know the number of terms, we can convert the data into corpora and create a document term matrix. Hereafter only a sample of the data will be used (first 1000 lines), due to the high memory comsumption and time required to process the full data set.

## <<DocumentTermMatrix (documents: 3, terms: 14955)>>
## Non-/sparse entries: 19990/24875
## Sparsity           : 55%
## Maximal term length: 96
## Weighting          : term frequency (tf)

##   Blogs    News Twitter 
##   21267   18887    6955

To find out if the words are evenly distributed through the sentences, we can use standardized word counts per sentence:

There appears to be no significant difference between sentence length among the three data sets, so they will be processed the same way.

Explore features of the data

We can plot the document term matrix to have an idea on the correlations between terms (correlation >= 0.6).

Here are the most frequent words in the data set:

## people    day   good  first   know   dont    get   time    new    can 
##    120    122    123    126    139    143    166    168    186    188 
##   like    one   just   will   said 
##    237    239    250    257    260

In order to build a language model, it is likely we will need to explore more thoroughly the relationships between words. In order to do so, we can create n-grams to understand what better captures our word sequence.

Next word prediction algorithm

Due to the nature of the sentences, I am building my model based on Restricted Boltzman Machines (a kind of Neural Network).

The language model is inspired by the following papers:

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model. 3(Feb):1137-1155, 2003. (http://dl.acm.org/citation.cfm?id=944966)
George E. Dahl, Ryan Prescott Adams, Hugo Larochelle. Training Restricted Boltzmann Machines on Word Observations CoRR abs/1202.5695 (2012) (http://arxiv.org/abs/1202.5695)
Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning (ICML ’07), Zoubin Ghahramani (Ed.). ACM, New York, NY, USA, 641-648. (http://dl.acm.org/citation.cfm?id=1273577)

There are a few libraries available to work with RBMs and similar Neural Networks. I am doing experiments with deepnet and darch.

Data Science Capstone - Milestone Report

Michele Silva

12/28/2015

Introduction

Basic summaries of the data

Explore features of the data

Next word prediction algorithm