This milestone report is an investigation to nature language process, includes loading and scanning data, selecting data, explore the sample data and summarize.
The texts from twitter, blog and news were downloaded and read. Some basic summary about the number of texts’ lines, number of texts’ words and number of texts’ characters was list below:
| Twitter.Text | Blogs.Text | News.Text | |
|---|---|---|---|
| Total Lines | 2360148 | 899288 | 77259 |
| Total Words | 37546239 | 2674536 | 30093372 |
| Total Characters | 206824505 | 15639408 | 162096031 |
Since the texts of twitter, blogs and news were too large, that we could select some sample texts or so called trainning data, to represent the whole text. The first 5 lines of sample text was showed below.
set.seed(1)
sample_mix <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
head(sample_mix)
## [1] "NOW JUL 3 - 9, 2003 VOL. 22 NO. 44"
## [2] "Pressure is especially high to develop new government revenue and legislators have indicated that removing tax advantages is more palatable than raising rates. But beyond that, why is the insurance tax status more vulnerable in 2012 than any other year?"
## [3] "I will be giving away a $15 giftcard to your favorite craft store or online store!"
## [4] "I'm saying a prayer for them"
## [5] "I like to read realistic novels with challenging themes so inevitably my Y.A. novels deal with quite strong issues. Hidden (March 2011) focused on asylum seekers, racist bullying and human rights. Illegal (March 2012) picks up the character of Lindy who first appears in Hidden, the bad girl with a nail sharpened to a spear and tells her story. I had become interested in the rising number of cannabis farms being discovered and raided by the police along the south coast. What if someone set up a cannabis farm on little Hayling Island and asked Lindy to run it?"
## [6] "I think setting it on Hayling Island is inspired as it seems to serve as a microcosm of a country's prejudices and tolerances but it also gives it a sense of claustrophobia - this is a place where everyone knows everyone else. But, as Alix discovers, we may not always know people as well as we think and we can all be guilty of making judgements and jumping to conclusions about others. This is a story of struggle and of hope, of prejudice and of tolerance,and of understanding that working out what's right or wrong isn't always black and white."
Using tm package could we convert the sample data into corpus, and find out the frequency of the most popular words and phrases.
library(NLP)
library(tm)
library(RWeka)
library(ggplot2)
library(forcats)
corpus <- VCorpus(VectorSource(sample_mix))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpusdf<-data.frame(text=unlist(sapply(corpus,'[',"content")),stringsAsFactors = FALSE)
head(corpusdf)
## text
## 1 now jul vol
## 2 pressure especially high develop new government revenue legislators indicated removing tax advantages palatable raising rates beyond insurance tax status vulnerable year
## 3 will giving away giftcard favorite craft store online store
## 4 im saying prayer
## 5 like read realistic novels challenging themes inevitably ya novels deal quite strong issues hidden march focused asylum seekers racist bullying human rights illegal march picks character lindy first appears hidden bad girl nail sharpened spear tells story become interested rising number cannabis farms discovered raided police along south coast someone set cannabis farm little hayling island asked lindy run
## 6 think setting hayling island inspired seems serve microcosm countrys prejudices tolerances also gives sense claustrophobia place everyone knows everyone else alix discovers may always know people well think can guilty making judgements jumping conclusions others story struggle hope prejudice toleranceand understanding working whats right wrong isnt always black white
Thus, according to the glimpse of corpus, we can see that the words were extracted from the sentences in the texts, without numbers, punctuations, stopwords and upper letters.
unigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
unigramtab <- TermDocumentMatrix(corpus,control=list(tokenize=unigram))
unigramcorpus <- findFreqTerms(unigramtab,lowfreq=1000)
unigramcorpusnum <- rowSums(as.matrix(unigramtab[unigramcorpus,]))
unigramcorpustab <- data.frame(Word=names(unigramcorpusnum),frequency=unigramcorpusnum)
unigramcorpustab <- unigramcorpustab[order(-unigramcorpustab$frequency),]
unigramcorpustab <- unigramcorpustab[1:10, ]
g_unigram <- ggplot(unigramcorpustab, aes(fct_reorder(Word, frequency), frequency))
g_unigram + geom_bar(stat="identity") + coord_flip() + labs(x = "Top 10 Popular words")
bigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
bigramtab <- TermDocumentMatrix(corpus,control=list(tokenize=bigram))
bigramcorpus <- findFreqTerms(bigramtab,lowfreq=80)
bigramcorpusnum <- rowSums(as.matrix(bigramtab[bigramcorpus,]))
bigramcorpustab <- data.frame(Word=names(bigramcorpusnum),frequency=bigramcorpusnum)
bigramcorpustab <- bigramcorpustab[order(-bigramcorpustab$frequency),]
bigramcorpustab <- bigramcorpustab[1:10, ]
g_bigram <- ggplot(bigramcorpustab, aes(fct_reorder(Word, frequency), frequency))
g_bigram + geom_bar(stat="identity") + coord_flip() + labs(x = "Top 10 Popular phrases of two words")
trigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
trigramtab <- TermDocumentMatrix(corpus,control=list(tokenize=trigram))
trigramcorpus <- findFreqTerms(trigramtab,lowfreq=10)
trigramcorpusnum <- rowSums(as.matrix(trigramtab[trigramcorpus,]))
trigramcorpustab <- data.frame(Word=names(trigramcorpusnum),frequency=trigramcorpusnum)
trigramcorpustab <- trigramcorpustab[order(-trigramcorpustab$frequency),]
trigramcorpustab <- trigramcorpustab[1:10, ]
g_trigram <- ggplot(trigramcorpustab, aes(fct_reorder(Word, frequency), frequency))
g_trigram + geom_bar(stat="identity") + coord_flip() + labs(x = "Top 10 Popular phrases of three words")
The “Top 10 words” and “Top 10 phrases of two words” were really making sense. However, the “Top 10 phrases of three words” was unexpectingly consist of some strange words, such as “yes yes yes” and “item c pp”. These facts were probably irregular due to sample selection.
The texts from twitter, blogs and news were conducted to form a corpus to explore nature language process. Top words and phrases were investigated and visuallized. Further investigation would be focused on building model, evaluation and visuallizing.