Data

The data that will be used in this exploratory analysis can be found at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip In particular I will be focusing on the US version of the data. There are three files (en_US.blogs.txt, en_US.news.txt,en_US.twitter.txt).

General description of data

The first step in this exploratory analysis is to describe the different data sources. In particular, how many lines and words in each file. This can be found in the following table:

filesummary<-data.frame(Source=c("Blogs","News","Twitter"), 
                        Lines=c(length(blogs),length(news),length(twitter)),
                        Words=c(sum(stri_count_words(blogs)),sum(stri_count_words(news)),sum(stri_count_words(twitter))))

filesummary
##    Source   Lines    Words
## 1   Blogs  899288 38154238
## 2    News   77259  2693898
## 3 Twitter 2360148 30218125

Sampling and Filtering

The second step in this exploratory analysis is to clean up and sample the data. Notice that the memory required to process the whole set is significant. In this case I will select a sampling size of 3000.

set.seed(5)
size<-3000 
CleanUpFunc<-function(x) { 
  x<-sample(x,size)
  x<-VCorpus(VectorSource(x))
  x<-tm_map(x,removeWords, stopwords("english"))
  x<-tm_map(x,removeNumbers)
  x<-tm_map(x,stripWhitespace)
  x<-tm_map(x,removePunctuation,preserve_intra_word_dashes = TRUE)
  }
blogs_clean<-CleanUpFunc(blogs); 
news_clean<-CleanUpFunc(news); 
twitter_clean<-CleanUpFunc(twitter);

Statistics on Data

In this step I will try to get a better feeling of the data by looking at two main items:

  1. 2 grams and 3 grams statitistics
  2. Word coverage

The reason I focused on this is two-fold. First, it may give an idea of the topics behind the sources but also, and most importantly, it will be very important to devise a predictive algorithm.

REMARK: I will use the RWeka library.

To calculate the coverage I will try to find the number of distinct words that are necessary to cover 50% of the text.

If we want to cover 90% of the text, on the other hand, we would need the following number of words.

Creating a predictive algorithm. Basic Idea.

Having a way to define n gram statistics on data would definitely help to devise an algorithm to predict the next word in a sentence.

My plan is the following: