Capstone Week 2 Milestone Report

This milestone report is an investigation to nature language process, includes loading and scanning data, selecting data, explore the sample data and summarize.

1. Loading and Scanning Data

The texts from twitter, blog and news were downloaded and read. Some basic summary about the number of texts’ lines, number of texts’ words and number of texts’ characters was list below:

Twitter.Text Blogs.Text News.Text
Total Lines 2360148 899288 77259
Total Words 37546239 2674536 30093372
Total Characters 206824505 15639408 162096031

2. Selecting sample data

Since the texts of twitter, blogs and news were too large, that we could select some sample texts or so called trainning data, to represent the whole text. The first 5 lines of sample text was showed below.

set.seed(1)
sample_mix <- c(sample(blogs, length(blogs) * 0.01),
            sample(news, length(news) * 0.01),
            sample(twitter, length(twitter) * 0.01))

head(sample_mix)
## [1] "NOW JUL 3 - 9, 2003 VOL. 22 NO. 44"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [2] "Pressure is especially high to develop new government revenue and legislators have indicated that removing tax advantages is more palatable than raising rates. But beyond that, why is the insurance tax status more vulnerable in 2012 than any other year?"                                                                                                                                                                                                                                                                                                                       
## [3] "I will be giving away a $15 giftcard to your favorite craft store or online store!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [4] "I'm saying a prayer for them"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "I like to read realistic novels with challenging themes so inevitably my Y.A. novels deal with quite strong issues. Hidden (March 2011) focused on asylum seekers, racist bullying and human rights. Illegal (March 2012) picks up the character of Lindy who first appears in Hidden, the bad girl with a nail sharpened to a spear and tells her story. I had become interested in the rising number of cannabis farms being discovered and raided by the police along the south coast. What if someone set up a cannabis farm on little Hayling Island and asked Lindy to run it?"
## [6] "I think setting it on Hayling Island is inspired as it seems to serve as a microcosm of a country's prejudices and tolerances but it also gives it a sense of claustrophobia - this is a place where everyone knows everyone else. But, as Alix discovers, we may not always know people as well as we think and we can all be guilty of making judgements and jumping to conclusions about others. This is a story of struggle and of hope, of prejudice and of tolerance,and of understanding that working out what's right or wrong isn't always black and white."

3. Explore the sample data

Using tm package could we convert the sample data into corpus, and find out the frequency of the most popular words and phrases.

library(NLP)
library(tm)
library(RWeka)
library(ggplot2)
library(forcats)
corpus <- VCorpus(VectorSource(sample_mix))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

corpusdf<-data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors = FALSE)
head(corpusdf)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                              text
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                   now jul vol  
## 2                                                                                                                                                                                                                                                                     pressure  especially high  develop new government revenue  legislators  indicated  removing tax advantages   palatable  raising rates  beyond     insurance tax status  vulnerable     year
## 3                                                                                                                                                                                                                                                                                                                                                                                                will  giving away  giftcard   favorite craft store  online store
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                             im saying  prayer  
## 5  like  read realistic novels  challenging themes  inevitably  ya novels deal  quite strong issues hidden march focused  asylum seekers racist bullying  human rights illegal march picks   character  lindy  first appears  hidden  bad girl   nail sharpened   spear  tells  story   become interested   rising number  cannabis farms  discovered  raided   police along  south coast   someone set   cannabis farm  little hayling island  asked lindy  run 
## 6                              think setting   hayling island  inspired   seems  serve   microcosm   countrys prejudices  tolerances   also gives   sense  claustrophobia    place  everyone knows everyone else   alix discovers  may  always know people  well   think   can   guilty  making judgements  jumping  conclusions  others    story  struggle   hope  prejudice   toleranceand  understanding  working  whats right  wrong isnt always black  white

Thus, according to the glimpse of corpus, we can see that the words were extracted from the sentences in the texts, without numbers, punctuations, stopwords and upper letters.

unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
unigramtab <- TermDocumentMatrix(corpus,control=list(tokenize=unigram))
unigramcorpus <- findFreqTerms(unigramtab,lowfreq=1000)
unigramcorpusnum <- rowSums(as.matrix(unigramtab[unigramcorpus,]))
unigramcorpustab <- data.frame(Word=names(unigramcorpusnum),frequency=unigramcorpusnum)
unigramcorpustab <- unigramcorpustab[order(-unigramcorpustab$frequency),]
unigramcorpustab <- unigramcorpustab[1:10, ]

g_unigram <- ggplot(unigramcorpustab, aes(fct_reorder(Word, frequency), frequency))
g_unigram + geom_bar(stat="identity") + coord_flip() + labs(x = "Top 10 Popular words")

bigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
bigramtab <- TermDocumentMatrix(corpus,control=list(tokenize=bigram))
bigramcorpus <- findFreqTerms(bigramtab,lowfreq=80)
bigramcorpusnum <- rowSums(as.matrix(bigramtab[bigramcorpus,]))
bigramcorpustab <- data.frame(Word=names(bigramcorpusnum),frequency=bigramcorpusnum)
bigramcorpustab <- bigramcorpustab[order(-bigramcorpustab$frequency),]
bigramcorpustab <- bigramcorpustab[1:10, ]
g_bigram <- ggplot(bigramcorpustab, aes(fct_reorder(Word, frequency), frequency))
g_bigram + geom_bar(stat="identity") + coord_flip() + labs(x = "Top 10 Popular phrases of two words")

trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
trigramtab <- TermDocumentMatrix(corpus,control=list(tokenize=trigram))
trigramcorpus <- findFreqTerms(trigramtab,lowfreq=10)
trigramcorpusnum <- rowSums(as.matrix(trigramtab[trigramcorpus,]))
trigramcorpustab <- data.frame(Word=names(trigramcorpusnum),frequency=trigramcorpusnum)
trigramcorpustab <- trigramcorpustab[order(-trigramcorpustab$frequency),]
trigramcorpustab <- trigramcorpustab[1:10, ]
g_trigram <- ggplot(trigramcorpustab, aes(fct_reorder(Word, frequency), frequency))
g_trigram + geom_bar(stat="identity") + coord_flip() + labs(x = "Top 10 Popular phrases of three words")

The “Top 10 words” and “Top 10 phrases of two words” were really making sense. However, the “Top 10 phrases of three words” was unexpectingly consist of some strange words, such as “yes yes yes” and “item c pp”. These facts were probably irregular due to sample selection.

4. Summary

The texts from twitter, blogs and news were conducted to form a corpus to explore nature language process. Top words and phrases were investigated and visuallized. Further investigation would be focused on building model, evaluation and visuallizing.