The first task is to read the files in. Due to the large size of the files a 10% random sample will be used. A decision was made not to use a stratified random sample and the fact that rare words would be under-representated was accepted. A random sample of data was drawn for analysis.
blog<-readLines('~/nlp/final/en_US/en_US.blogs.txt')
blogs<-blog[rbinom(length(blog)*.10,length(blog), .5)]
NROW(blogs)
[1] 29002
> NROW(twitss)
[1] 236014
> NROW(news)
[1] 101024
NROW(all);sum((nchar(all) - nchar(gsub(' ','',all))) + 1)
[1] 366040
[1] 7644999
This involves the study of distribution of types (words and other units) in texts. Initially reported by Zipf (1949). A R package zipfR is available package Author Stefan Evert, Marco Baroni. Text corpora are different from other categorical data due to the richness of types.
N: Sample/corpus size, number of tokens in the sample V: vocabulary size, number of distinct types in the sample V m : type count of spectrum element m, number of types in the sample with token frequency m
Across languages and corpora a simple pattern is noted There are a few giants (more frequent) and many dwarves (less frequent, more numerous) The nature of the relationship is clarified on plotting the log of the frequency against the log of the rank. Zipf’s law predicts that most frequent word has frequency 60,000; second most frequent word has frequency 30,000; third word has frequency 20,000
## Loading required package: NLP
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
newsdata<-read.csv('~/nlp/input/news.sample.csv',header=FALSE,stringsAsFactors=FALSE)
news.spc<-text2spc.fnc(newsdata)
plot(news.spc,log='x',main='News Frequency Spectrum')
#http://zipfr.r-forge.r-project.org/materials/ESSLLI/04_zipfr.slides.pdf
blog.zm <- lnre("zm", blog.spc)
summary(blog.zm)
blog.zm.spc <- lnre.spc(blog.zm, 10*N(blog.zm))
twit.zm <- lnre("zm", twit.spc)
summary(twit.zm)
twit.zm.spc <- lnre.spc(twit.zm, 10*N(twit.zm))
news.zm <- lnre("zm", news.spc)
summary(news.zm)
news.zm.spc <- lnre.spc(news.zm, 10*N(twit.zm))
plot(blog.zm.spc, twit.zm.spc,news.zm.spc,
legend=c("blog.zm","twit.zm","news.zm"))
The corpus was transformed and word frequency analyzed. The results of this analysis for the words with a frequency of greater than 5000 is presented in the graph below.
blog.corpus<-Corpus(VectorSource(bldata))
blog.corpus<-tm_map(blog.corpus,tolower)
blog.corpus<-tm_map(blog.corpus,PlainTextDocument)
blog.corpus<-tm_map(blog.corpus,removePunctuation)
blog.corpus<-tm_map(blog.corpus,stripWhitespace)
profanity <- read.csv('~/nlp/final/profanity.txt',header = FALSE,stringsAsFactors = FALSE)
blog.corpus<-tm_map(blog.corpus,rmProfanity)
blog.corpus<-tm_map(blog.corpus,rmSpecialChars)
blog.corpus <- tm_map(blog.corpus,content_transformer(stripWhitespace))
tdmblog <- TermDocumentMatrix(blog.corpus, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
dtmblog <- DocumentTermMatrix(blog.corpus)
m <- as.matrix(tdmblog)
v <- sort(rowSums(m), decreasing=TRUE)
head(v, N)
freq <- sort(colSums(as.matrix(dtmblog)), decreasing=TRUE)
head(freq, N)
wfblog <- data.frame(word=names(freq), freq=freq)
Word frequency from the three sources was plotted side by side
ggplot(subset(wfcombo,freq>5000), aes(word,freq,fill=source)) + geom_bar(stat='identity',position='dodge') +theme(axis.text.x=element_text(angle=45, hjust=1))
As the lexical analysis and most frequent common words were quite similar across the three corpora it was decided to combine them into a single corpus for analysis.
A quick efficient method for making ngrams using library(stylo)
library(stylo)
corpus = txt.to.words(blog.corpus)
ngrams.1<-make.ngrams(corpus, ngram.size = 1)
ngrams.2<-make.ngrams(corpus, ngram.size = 2)
ngrams.3<-make.ngrams(corpus, ngram.size = 3)
#Top 20 individual phrases
top.20<-ngram.1[1:20,]
p1<-ggplot(top.20,aes(top.20$ngrams.1,top.20$Freq))
p1<-p1 + geom_bar(stat="identity")
p1+ theme(axis.text.x=element_text(angle=45, hjust=1))
p1+coord_flip()
## stylo version: 0.5.9
Bigrams
Trigrams
This has been an amazing learning experience thus far. I am not sure I will be able to pull off the next phase of the course. Initially I had planned on having a separate prediction method for each of the three sources of text. However after exploratory analysis it is evident that there is no need for this approach. I would like to keep the predictive method simple. The model will be developed using the 10% random sample. If stratified sampling techniques were used the rarer words may be over-predicted. I am not sure how to impliment the predictive model in shiny.