Exploratory Analysis - Milestone Report

Data

The data that will be used in this exploratory analysis can be found at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip In particular I will be focusing on the US version of the data. There are three files (en_US.blogs.txt, en_US.news.txt,en_US.twitter.txt).

General description of data

The first step in this exploratory analysis is to describe the different data sources. In particular, how many lines and words in each file. This can be found in the following table:

filesummary<-data.frame(Source=c("Blogs","News","Twitter"), 
                        Lines=c(length(blogs),length(news),length(twitter)),
                        Words=c(sum(stri_count_words(blogs)),sum(stri_count_words(news)),sum(stri_count_words(twitter))))

filesummary

##    Source   Lines    Words
## 1   Blogs  899288 38154238
## 2    News   77259  2693898
## 3 Twitter 2360148 30218125

Sampling and Filtering

The second step in this exploratory analysis is to clean up and sample the data. Notice that the memory required to process the whole set is significant. In this case I will select a sampling size of 3000.

set.seed(5)
size<-3000 
CleanUpFunc<-function(x) { 
  x<-sample(x,size)
  x<-VCorpus(VectorSource(x))
  x<-tm_map(x,removeWords, stopwords("english"))
  x<-tm_map(x,removeNumbers)
  x<-tm_map(x,stripWhitespace)
  x<-tm_map(x,removePunctuation,preserve_intra_word_dashes = TRUE)
  }
blogs_clean<-CleanUpFunc(blogs); 
news_clean<-CleanUpFunc(news); 
twitter_clean<-CleanUpFunc(twitter);

Statistics on Data

In this step I will try to get a better feeling of the data by looking at two main items:

2 grams and 3 grams statitistics
Word coverage

The reason I focused on this is two-fold. First, it may give an idea of the topics behind the sources but also, and most importantly, it will be very important to devise a predictive algorithm.

REMARK: I will use the RWeka library.

To calculate the coverage I will try to find the number of distinct words that are necessary to cover 50% of the text.

The number of words necessary to cover 50% of the total number of words in the en_US.news.txt sample (5.98810^{4}) is 1064
The number of words necessary to cover 50% of the total number of words in the en_US.blog.txt sample (6.569510^{4}) is 835
The number of words necessary to cover 50% of the total number of words in the en_US.twitter.txt sample (2.149510^{4}) is 498

If we want to cover 90% of the text, on the other hand, we would need the following number of words.

The number of words necessary to cover 90% of the total number of words in the en_US.news.txt sample (5.98810^{4}) is 9670
The number of words necessary to cover 90% of the total number of words in the en_US.blog.txt sample (6.569510^{4}) is 9322
The number of words necessary to cover 90% of the total number of words in the en_US.twitter.txt sample (2.149510^{4}) is 4964

Creating a predictive algorithm. Basic Idea.

Having a way to define n gram statistics on data would definitely help to devise an algorithm to predict the next word in a sentence.

My plan is the following:

Step 1: Obtain n gram statistics on a large text sample (i.e. 1,2,3,…,n)
Step 2: Given a sentence with m words, to predict m+1 I will check if the previous n-1 words match any of the first n-1 words of any n gram I have. If there is at least a match, select the corresponding matching n gram with the highest frequency, provide the nth word of the n gram as a prediction and stop. Otherwise, move to Step 3.
Step 3: Apply Step 2 recursively by decresing n until n = 1. If no prediction can be presented an empty string will be provided.