Data Collection

The data were downloaded from the course website and unzipped to extract the English database as a corpus. Three text documents from the twitter, blog and news were found with each line standing for a message. The dataset contains news, blogs and tweets in four different languages, English, German, Russian and Finish. The english Corpus has three datasets, with the following statistics: Twitter: Small sentence(s), maximum number of characters observed is 213. There are 167 million characters, in 30 million words, in 2.3 million tweets. Blogs: Paragraphs. Multiple sentences per blog. Largest sentence has 40835 characters. In total, this dataset has 37 million words in less than a million lines. News: Paragraphs. Multiple sentences. Largest sentence has 11384 characters, total words are 30 million in 1 million lines.

Data exploration

library(ggplot2)
library(stringr)
library(tm)
library(R.utils)
library(RTextTools)
library(wordcloud)

filelist = c("en_US.twitter.txt","en_US.news.txt","en_US.blogs.txt")
getsummary_file = function(filename){

        con <- file(paste("final//en_US//",filename,sep = ""), "r")
        nlines <- countLines(file = paste("final//en_US//",filename,sep = ""))
        text.data <- readLines(con,n = nlines,encoding = "UTF-8", warn = FALSE)
        nchar.perline <- unlist(lapply(X = text.data, FUN = nchar))
        longest.line <- max(nchar.perline)
        shortest.line <- min(nchar.perline)
        median.line <- median(nchar.perline)
        n.sentences <- sum(str_count(text.data,"[\\.|?]"))
        word.list = strsplit(text.data, "\\W+", perl=TRUE)
        words <- unlist(word.list)
        word.count <- length(words)
        unique.words <- length(unique(words))
        top10.words <- head(sort(table(words) , decreasing=TRUE),10)
        summary.text <- cbind(filename,nlines,longest.line, shortest.line,median.line, n.sentences, word.count)
        #         names(summary.text) = c(filename,"longest.line","shortest.line","median.line",
        #                                    "no.of.sentences","no.of.words")
        print("top 10 words are: ")
        print(top10.words)
        print("file summary")
        print(summary.text)
        hi.df <- data.frame(nchar = nchar.perline)
        m <- ggplot(hi.df, aes(x = nchar))+ggtitle(paste("density plot number of characters per line in",filename))
        m <- m + geom_density()
        print(m)
        close(con)
        return(text.data)
}

Short summary of the data:

Twitter

tweets <-getsummary_file(filelist[1])

## [1] "top 10 words are: "
## words
##    the      I     to      a    you    and    for     in     of     is 
## 842294 804214 770738 578042 522523 405729 373100 360568 351926 339363 
## [1] "file summary"
##      filename            nlines    longest.line shortest.line median.line
## [1,] "en_US.twitter.txt" "2360148" "140"        "2"           "64"       
##      n.sentences word.count
## [1,] "3020319"   "31150908"

News

tweets <-getsummary_file(filelist[2])

## [1] "top 10 words are: "
## words
##     the      to     and       a      of      in       s    that     for 
## 1720341  898055  857242  844540  771103  633110  418779  341488  337611 
##      is 
##  281764 
## [1] "file summary"
##      filename         nlines    longest.line shortest.line median.line
## [1,] "en_US.news.txt" "1010242" "11384"      "1"           "185"      
##      n.sentences word.count
## [1,] "2209839"   "35793026"

Blogs

tweets <-getsummary_file(filelist[3])

## [1] "top 10 words are: "
## words
##     the      to     and       I      of       a      in    that      is 
## 1669721 1055462 1036035  889792  868442  865336  555938  459389  426408 
##      it 
##  382723 
## [1] "file summary"
##      filename          nlines   longest.line shortest.line median.line
## [1,] "en_US.blogs.txt" "899288" "40833"      "1"           "156"      
##      n.sentences word.count
## [1,] "2294984"   "38378182"

Tokenization and Exploratory Analysis

The whole tokenization is aiming at removing meaningless characters and the words with low frequency in the corpus. The final corpus will show the words or n-gram with a high frequency which will be helpful for exploring the relationship between the words and building a manful statistical model.

So i cleaned the data with removing the ASCII characters,changing the capital characters to lower case,removing the punctuation numbers and stop words and stemming the left words. To decrease the spares of the term frequency, I removed the terms occurred less than ten times in the whole document to get the final corpus. Let me first do Twitter data:

index <- as.logical (rbinom (n = length (tweets), size = 1, prob = 0.10))
tweets <- tweets [index]
dat = grep("tweets",iconv(tweets,"latin1","ASCII",sub = "tweets"))
tweets = tweets[-dat]
# create a corpus
tweets_corpus <- Corpus (VectorSource(tweets))
sc <- tm_map(tweets_corpus, removeNumbers)
sc <- tm_map(sc, removePunctuation)
sc <- tm_map(sc, tolower)
sc <- tm_map(sc, removeWords, stopwords("english"))
sc <- tm_map(sc, stripWhitespace)
myCorpus <- tm_map(sc, PlainTextDocument)

create a wordcloud

wordcld <- wordcloud (myCorpus, 
           scale=c(5,0.5), 
           max.words=200, 
           random.order=FALSE, 
           rot.per=0.35, 
           use.r.layout=FALSE, 
           colors=brewer.pal(8, 'Dark2'))

alt text

Next step:

Selection of most suited NLP algorithm and package
Accent filtering
Profanity substituion such that the context of the sentence remains intact
Build and experiment with the size, performance of different n-gram models
Shiny deloyment

Milestone

Babu Bhandari

03/19/2015