In this project, we have analyzed news large corpora of data coming from a number of sources. This first milestone is a report on the analysis of the documents and their vocabulary.
In this exercise we have analyzed the three large corpora provided: twitter, news, and blogs. All have very different characteristics, including most frequent words.
We have seen that the most frequent words are, not surprisingly, stop words. We removed those from the analysis, but will not remove them from the prediction model.
Further experiments showed that bi-grams seem the be most prevalent list of words. We will therefore build a HMM (Hidden Markov Model) prediction model based on current word, predicting the next one.
Ideally, we could build one model for each of the datasets, as they are very different. However, for the sake of simplicity and coverage, we will build one universal model.
We will download the files from their location, unzip them, and collect basic statistics using the wc command.
# download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip','DSCapstone.zip')
# unzip('DSCapstone.zip')
setwd('C:/Users/fragnet/Documents/final/en_US')
system("wc -l -w *.txt", intern=TRUE)
## [1] " 899288 37272578 en_US.blogs.txt"
## [2] " 1010242 34309642 en_US.news.txt"
## [3] " 2360148 30341028 en_US.twitter.txt"
## [4] " 4269678 101923248 total"
The twitter dataset has the most entries with over 2.3M entries but contains the least words (30 million). Blogs and News are slightly bigger files with longer entries, with slightly more words (37 and 34 million respectively), for much fewer entries (around 900K blogs and 1M news).
Let us first load a number of required libraries, then open a file handle on the US twitter feed, and read a sample of it (the file is way too large to process it all).
Let us now use the tm and other associated libraries and create a corpus. We will apply some basic pre-processing on this corpus, like removing numbers, punctuation, and white spaces.
library(tm)
library(SnowballC)
library(tau)
#Read the corpus
corpus <- Corpus(VectorSource(sample))
#Cleanup using various TM primitives
corpus <- tm_map(corpus, content_transformer(tolower))
#corpus2<- tm_map(corpus, stemDocument)
corpus<- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus<- tm_map(corpus, stripWhitespace)
For more advanced processing, we will now create a custom content transformer to remove most common profanity words, based on simple regular expression mapping.
# Define two custom content transformers functions - remove, replace
(profanity <- content_transformer(function(x) gsub("\\w*(fuck|bitch|shit)\\w*", "", x)))
#Use custom content transformer to remove profanity words
corpus<- tm_map(corpus, profanity)
Using tm, let us now create a Document Term Matrix from the full corpus, then find the 10 most frequent words in the corpus:
dtm <- DocumentTermMatrix(corpus)
dtm<-removeSparseTerms(dtm,0.9999)
matDtm <- as.matrix(dtm)
frequency <- colSums(matDtm)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency,10)
## the you and for that have with your are this
## 9292 5408 4361 3852 2357 1739 1730 1693 1636 1597
Let us now do the same after removing english stop words, this time after removing english stop words:
corpus2 <- tm_map(corpus, removeWords, stopwords("en"))
dtmNoSW <- DocumentTermMatrix(corpus2)
dtmNoSW<-removeSparseTerms(dtmNoSW,0.9999)
matDtmNoSW <- as.matrix(dtmNoSW)
frequency2 <- colSums(matDtmNoSW)
frequency2<- sort(frequency2, decreasing=TRUE)
head(frequency2,10)
## just like get love good will dont day can thanks
## 1557 1207 1070 1047 976 923 904 900 869 860
Clearly, the words after removing english stop words are much more relevant and meaningful for analysis - stop words come up first otherwise. Let us print a wordcloud of relevant keywords, as well as a graph with frequencies.
## Warning in wordcloud(words[1:100], frequency2[1:100], colors = pal): just
## could not be fit on page. It will not be plotted.
Let us now plot the bigram and trigram distributions. We actually start with a model that accomodates both bigrams and trigrams.
library(RWeka)
library(tm)
library(wordcloud)
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
tdm <- TermDocumentMatrix(corpus2, control = list(tokenize = trigramTokenizer))
tdm2<-removeSparseTerms(tdm,0.9995)
paste("dimensions of the resulting matrix:",dim(tdm2))
## [1] "dimensions of the resulting matrix: 273"
## [2] "dimensions of the resulting matrix: 23601"
matTdm2 <- as.matrix(tdm2)
frequency <- rowSums(matTdm2)
frequency <- sort(frequency, decreasing=TRUE)
words <- names(frequency)
wordcloud(words[1:50], frequency[1:50],colors=pal)
We see a few interesting patterns appear here as well. It looks like bi-grams will be a good basis for predicting the next keyword.
We repeat the same process for the News corpus this time to generate the corresponding wordcloud. It looks very different.
## Loading required package: NLP
## Loading required package: RColorBrewer
We repeat the same process for the blog corpus this time to generate the corresponding wordcloud. It looks very different again. A few strange characters are appearing, we will attempt to remove those in a future version of the content transformers.
```