The data that will be used in this exploratory analysis can be found at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip In particular I will be focusing on the US version of the data. There are three files (en_US.blogs.txt, en_US.news.txt,en_US.twitter.txt).
The first step in this exploratory analysis is to describe the different data sources. In particular, how many lines and words in each file. This can be found in the following table:
filesummary<-data.frame(Source=c("Blogs","News","Twitter"),
Lines=c(length(blogs),length(news),length(twitter)),
Words=c(sum(stri_count_words(blogs)),sum(stri_count_words(news)),sum(stri_count_words(twitter))))
filesummary
## Source Lines Words
## 1 Blogs 899288 38154238
## 2 News 77259 2693898
## 3 Twitter 2360148 30218125
The second step in this exploratory analysis is to clean up and sample the data. Notice that the memory required to process the whole set is significant. In this case I will select a sampling size of 3000.
set.seed(5)
size<-3000
CleanUpFunc<-function(x) {
x<-sample(x,size)
x<-VCorpus(VectorSource(x))
x<-tm_map(x,removeWords, stopwords("english"))
x<-tm_map(x,removeNumbers)
x<-tm_map(x,stripWhitespace)
x<-tm_map(x,removePunctuation,preserve_intra_word_dashes = TRUE)
}
blogs_clean<-CleanUpFunc(blogs);
news_clean<-CleanUpFunc(news);
twitter_clean<-CleanUpFunc(twitter);
In this step I will try to get a better feeling of the data by looking at two main items:
The reason I focused on this is two-fold. First, it may give an idea of the topics behind the sources but also, and most importantly, it will be very important to devise a predictive algorithm.
REMARK: I will use the RWeka library.
To calculate the coverage I will try to find the number of distinct words that are necessary to cover 50% of the text.
If we want to cover 90% of the text, on the other hand, we would need the following number of words.
Having a way to define n gram statistics on data would definitely help to devise an algorithm to predict the next word in a sentence.
My plan is the following: