The report explores the SwiftKey input data set and presents a path to design a next word prediction algorithm. It presents an exploratory analysis and details the main goals of this project. It also explains the major features of the data and briefly summarizes the plans for creating the prediction algorithm and Shiny app.
The data set is composed by millions of words, extracted from blogs, news and twitter streams.
## Lines Words Characters File
## 1 899288 37334114 210160014 en_US/en_US.blogs.txt
## 2 1010242 34365936 205811889 en_US/en_US.news.txt
## 3 2360148 30359804 167105338 en_US/en_US.twitter.txt
## 4 4269678 102059854 583077241 total
In order to know the number of terms, we can convert the data into corpora and create a document term matrix. Hereafter only a sample of the data will be used (first 1000 lines), due to the high memory comsumption and time required to process the full data set.
## <<DocumentTermMatrix (documents: 3, terms: 14955)>>
## Non-/sparse entries: 19990/24875
## Sparsity : 55%
## Maximal term length: 96
## Weighting : term frequency (tf)
## Blogs News Twitter
## 21267 18887 6955
To find out if the words are evenly distributed through the sentences, we can use standardized word counts per sentence:
There appears to be no significant difference between sentence length among the three data sets, so they will be processed the same way.
We can plot the document term matrix to have an idea on the correlations between terms (correlation >= 0.6).
Here are the most frequent words in the data set:
## people day good first know dont get time new can
## 120 122 123 126 139 143 166 168 186 188
## like one just will said
## 237 239 250 257 260
In order to build a language model, it is likely we will need to explore more thoroughly the relationships between words. In order to do so, we can create n-grams to understand what better captures our word sequence.
Due to the nature of the sentences, I am building my model based on Restricted Boltzman Machines (a kind of Neural Network).
The language model is inspired by the following papers:
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. A Neural Probabilistic Language Model. 3(Feb):1137-1155, 2003. (http://dl.acm.org/citation.cfm?id=944966)
George E. Dahl, Ryan Prescott Adams, Hugo Larochelle. Training Restricted Boltzmann Machines on Word Observations CoRR abs/1202.5695 (2012) (http://arxiv.org/abs/1202.5695)
Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning (ICML ’07), Zoubin Ghahramani (Ed.). ACM, New York, NY, USA, 641-648. (http://dl.acm.org/citation.cfm?id=1273577)
There are a few libraries available to work with RBMs and similar Neural Networks. I am doing experiments with deepnet and darch.