Loading the Data; Chosing sampling method.

The first task is to read the files in. Due to the large size of the files a 10% random sample will be used. A decision was made not to use a stratified random sample and the fact that rare words would be under-representated was accepted. A random sample of data was drawn for analysis.

blog<-readLines('~/nlp/final/en_US/en_US.blogs.txt')
blogs<-blog[rbinom(length(blog)*.10,length(blog), .5)]

Analysis of three sources of data

NROW(blogs)
[1] 29002
> NROW(twitss)
[1] 236014
> NROW(news)
[1] 101024

Statistics of combined corpus

 NROW(all);sum((nchar(all) - nchar(gsub(' ','',all))) + 1)
[1] 366040
[1] 7644999

Lexical Statistics

This involves the study of distribution of types (words and other units) in texts. Initially reported by Zipf (1949). A R package zipfR is available package Author Stefan Evert, Marco Baroni. Text corpora are different from other categorical data due to the richness of types.

Basic terminology of zipfR.

N: Sample/corpus size, number of tokens in the sample V: vocabulary size, number of distinct types in the sample V m : type count of spectrum element m, number of types in the sample with token frequency m

Zipfs law

Across languages and corpora a simple pattern is noted There are a few giants (more frequent) and many dwarves (less frequent, more numerous) The nature of the relationship is clarified on plotting the log of the frequency against the log of the rank. Zipf’s law predicts that most frequent word has frequency 60,000; second most frequent word has frequency 30,000; third word has frequency 20,000

Lexical analysis of Corpora from blog, twitter and news sources

## Loading required package: NLP
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Analysis of data

newsdata<-read.csv('~/nlp/input/news.sample.csv',header=FALSE,stringsAsFactors=FALSE)
news.spc<-text2spc.fnc(newsdata)
plot(news.spc,log='x',main='News Frequency Spectrum')

#http://zipfr.r-forge.r-project.org/materials/ESSLLI/04_zipfr.slides.pdf
blog.zm <- lnre("zm", blog.spc)
summary(blog.zm)
blog.zm.spc <- lnre.spc(blog.zm, 10*N(blog.zm))
twit.zm <- lnre("zm", twit.spc)
summary(twit.zm)
twit.zm.spc <- lnre.spc(twit.zm, 10*N(twit.zm))
news.zm <- lnre("zm", news.spc)
summary(news.zm)
news.zm.spc <- lnre.spc(news.zm, 10*N(twit.zm))
plot(blog.zm.spc, twit.zm.spc,news.zm.spc,
     legend=c("blog.zm","twit.zm","news.zm"))

Word Frequency

The corpus was transformed and word frequency analyzed. The results of this analysis for the words with a frequency of greater than 5000 is presented in the graph below.

blog.corpus<-Corpus(VectorSource(bldata))
blog.corpus<-tm_map(blog.corpus,tolower)
blog.corpus<-tm_map(blog.corpus,PlainTextDocument)
blog.corpus<-tm_map(blog.corpus,removePunctuation)
blog.corpus<-tm_map(blog.corpus,stripWhitespace)
profanity  <- read.csv('~/nlp/final/profanity.txt',header = FALSE,stringsAsFactors = FALSE)
blog.corpus<-tm_map(blog.corpus,rmProfanity)
blog.corpus<-tm_map(blog.corpus,rmSpecialChars)
blog.corpus <- tm_map(blog.corpus,content_transformer(stripWhitespace))
tdmblog <- TermDocumentMatrix(blog.corpus, control = list(removePunctuation = TRUE,
                                                removeNumbers = TRUE,
                                                stopwords = TRUE))
dtmblog <- DocumentTermMatrix(blog.corpus)
m <- as.matrix(tdmblog)
v <- sort(rowSums(m), decreasing=TRUE)
head(v, N)
freq <- sort(colSums(as.matrix(dtmblog)), decreasing=TRUE)   
head(freq, N) 
wfblog <- data.frame(word=names(freq), freq=freq) 

Word frequency from the three sources was plotted side by side

ggplot(subset(wfcombo,freq>5000), aes(word,freq,fill=source)) + geom_bar(stat='identity',position='dodge') +theme(axis.text.x=element_text(angle=45, hjust=1)) 

Final corpus and n grams

As the lexical analysis and most frequent common words were quite similar across the three corpora it was decided to combine them into a single corpus for analysis.

Library(stylo)

A quick efficient method for making ngrams using library(stylo)

library(stylo)
corpus = txt.to.words(blog.corpus)

ngrams.1<-make.ngrams(corpus, ngram.size = 1)
ngrams.2<-make.ngrams(corpus, ngram.size = 2)
ngrams.3<-make.ngrams(corpus, ngram.size = 3)

Results of ngrams and their statistics are presented below

#Top 20 individual phrases
top.20<-ngram.1[1:20,]
p1<-ggplot(top.20,aes(top.20$ngrams.1,top.20$Freq))
p1<-p1 + geom_bar(stat="identity")   
p1+ theme(axis.text.x=element_text(angle=45, hjust=1))
p1+coord_flip()
## stylo version: 0.5.9

Bigrams

Trigrams

Future Plans

This has been an amazing learning experience thus far. I am not sure I will be able to pull off the next phase of the course. Initially I had planned on having a separate prediction method for each of the three sources of text. However after exploratory analysis it is evident that there is no need for this approach. I would like to keep the predictive method simple. The model will be developed using the 10% random sample. If stratified sampling techniques were used the rarer words may be over-predicted. I am not sure how to impliment the predictive model in shiny.