Author: Jacob Govshteyn Date: Saturday, July 25, 2015
synopsis
Through exploratory data analysis a thorough understanding of the training data set will aid in the development of an accurate next-word perdition model. A deep dive of the of the three social media data sets should shed light on possible techniques needed to sample, mine, and parse data from each set.
Due to memory limitation, a 1% sample of each source will be used for analysis.
Tokenization
From the raw pool of text, each document is divided into a sequence of tokens, which roughly correspond to "words". In order to properly index each token, several normalization tasks are applied to the set of tokens.
Obviously each entry in the twitter doc is a single tweet, but it's not clear how the blog and news entries are derived. It's unclear whether blog and news entries are user comments, random sentences, or whole entries. A crude, yet effective method to query the entries would be to simply Google some random entries from each set.
From examination, each blog and news entry is segmented by newlines(). This isn't a line in a traditional sense, from beginning to end of the page, but roughly equals a paragraph. Since tweets are limited to 144 characters, this helps explain the difference in mean words per Line between tweets and news/blog entries.
Analysis of Vocabulary
In order to build an accurate model, a well-represented slice of the population must be used in the compilation of training and test data. Analysis of the three documents should reveal localized grammar bias specific to any individual sources, and correlation between any two sources.
corpusClean <- function(file.Corpus) {
#remove select punctuation
removePunctuation<- function(x) {
x<- gsub("[^[:alnum:][:space:]\'-]", "", x)
gsub("(\\w[\'-]\\w)|[[:punct:]]", "\\1", x)
}
#fix apostrophe Gremlins.
fixGremlins <- function(x) {
gsub("[\u0091\u0092]", "'", x)
}
#repeat words
removeRepeat <- function(x) {
gsub("[[:space:]](\\w+)[[:space:]]\\1", " \\1", x)
}
#url
removeUrl <- function(x) {
gsub("http\\w+|www\\w+", "", x)
}
#html\xml
removeMarkup <- function(x) {
gsub("<\\w+>","",x)
}
#twitter meta data removal
removeMeta <- function(x) {
#@UserName
x <- gsub("@\\w+", "", x)
#hashtag
x <- gsub("#\\w+", "", x)
#retweets\via
gsub("^(RT|via)|(RT :)", "", x)
}
removeNumeric <- function(x) {
gsub("(\\w*[0-9]+\\w*)", "", x)
}
file.Corpus <- tm_map(file.Corpus, content_transformer(removeNumeric))
file.Corpus <- tm_map(file.Corpus, content_transformer(removeMeta))
file.Corpus <- tm_map(file.Corpus, content_transformer(removeMarkup))
file.Corpus <- tm_map(file.Corpus, content_transformer(removePunctuation))
file.Corpus <- tm_map(file.Corpus, content_transformer(fixGremlins))
file.Corpus <- tm_map(file.Corpus, content_transformer(removeUrl))
file.Corpus <- tm_map(file.Corpus, content_transformer(tolower))
file.Corpus <- tm_map(file.Corpus, removeWords, badWorder)
file.Corpus <- tm_map(file.Corpus, content_transformer(removePunctuation))
}
cleanFreq<-function(object.sample,n=1){
# special token for apostrophe
APS <- "__ap__"
getGrams <- function(x) NGramTokenizer(x, Weka_control(min=n,max=n))
subtApo <- function(x) {
gsub("'", APS, x)
}
resApo <- function(x){
string <- gsub(APS, "'", x)
}
#Creat and clean corpus object from sample
object.corpus <- corpusClean(VCorpus(VectorSource(object.sample)))
object.corpus<- tm_map(object.corpus, content_transformer(subtApo))
object.tdm <- TermDocumentMatrix(object.corpus,
control = list(stopwords=FALSE,
wordLengths = c(1,100),
tokenize=getGrams))
m <- as.matrix(object.tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
# create a data table with words and their frequencies
dt.object= data.table(word=resApo(names(word_freqs)), freq=word_freqs)
return(dt.object)
}
#Creat clean frequencies object from sample
dt.freq.tweets<-cleanFreq(sample.tweets)
dt.freq.tweets[,freq:=dt.freq.tweets$freq/sum(dt.freq.tweets$freq)]
dt.freq.news<-cleanFreq(sample.news)
dt.freq.news[,freq:=dt.freq.news$freq/sum(dt.freq.news$freq)]
dt.freq.blogs<-cleanFreq(sample.blogs)
dt.freq.blogs[,freq:=dt.freq.blogs$freq/sum(dt.freq.blogs$freq)]
dt.freq.merged=merge(merge(dt.freq.tweets,dt.freq.blogs,by="word",all=TRUE),
dt.freq.news,by="word",all=TRUE)
for (j in names(dt.freq.merged))
set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
wordThresh=.005
setkey(dt.freq.merged,tweets,blogs,news)
op <- options(gvis.plot.tag = "chart")
Bar <- gvisBarChart(dt.freq.merged[tweets>wordThresh|blogs>wordThresh|news>wordThresh],
options=list(height=900, width=900,title='Top Word Frequency in Each Document'))
plot(Bar)
From the word frequency graph it's clear that vocabulary differs between the text sources.
One obvious difference is the Narrative point-of-view.
Tweets tend to use first and second person perspective.
I
me
you
News articles are almost exclusively told from a third person perspective.
he
his
Blogs have a mixture of third and first person Narrative.
she
them
we
There's also a difference in gender specific terminology. News articles are more masculine oriented, with high frequency of terms like "he" and "his", while blogs have more feminine terms like "she" and "her".
corrgram(dt.freq.merged, order=TRUE, lower.panel=panel.pie, upper.panel=panel.conf)
title(main=paste("Corrilaiton of word frequencies"))
Based on correlation of word frequencies, tweets and news articles are least alike, while blog vocabulary equally resembles tweets and news articles, with about a 95% correlation to each.
Analysis of Language
#Creat frequencies Bigram data tables
dt.freq.tweets<-cleanFreq(sample.tweets,2)
dt.freq.news<-cleanFreq(sample.news,2)
dt.freq.blogs<-cleanFreq(sample.blogs,2)
#merge each source text
dt.freq.merged=merge(merge(dt.freq.tweets[freq>15],
dt.freq.blogs[freq>15],
by="word",
all=TRUE),
dt.freq.news[freq>15],
by="word",
all=TRUE)
for (j in names(dt.freq.merged))
set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
#calculate totals of frequency
dt.total=dt.freq.merged[, .SD$tweets+.SD$blogs+.SD$news, by=word ]
dt.total$parent="BiGram"
names(dt.total)<-c("word","value","parent")
dt.freq.melted<-melt(dt.freq.merged)
names( dt.freq.melted)<-c("parent","word","value")
#break down each frequency by source doc
dt.freq.melted[,word:=paste(dt.freq.melted$word,"-\'",
dt.freq.melted$parent,"\'",
sep='')]
dt.top=data.table(parent=NA,word="BiGram",value=sum(dt.total$value))
dt.freq.merged<-rbind(dt.freq.melted,dt.total,dt.top)
dt.freq.merged[,per:=dt.freq.merged$value]
biwordTree <- gvisTreeMap(dt.freq.merged,
idvar="word",
parentvar="parent",
"value",
options=list(height=700, width=450))
#Creat frequencies Trigram data tables
dt.freq.tweets<-cleanFreq(sample.tweets,3)
dt.freq.news<-cleanFreq(sample.news,3)
dt.freq.blogs<-cleanFreq(sample.blogs,3)
#merge each source text
dt.freq.merged=merge(merge(dt.freq.tweets[freq>3],
dt.freq.blogs[freq>3],
by="word",
all=TRUE),
dt.freq.news[freq>3],
by="word",
all=TRUE)
for (j in names(dt.freq.merged))
set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
#calculate totals of frequency
dt.total=dt.freq.merged[, .SD$tweets+.SD$blogs+.SD$news, by=word ]
dt.total$parent="TriGram"
names(dt.total)<-c("word","value","parent")
dt.freq.melted<-melt(dt.freq.merged)
names( dt.freq.melted)<-c("parent","word","value")
#break down each frequency by source doc
dt.freq.melted[,word:=paste(dt.freq.melted$word,"-\'",
dt.freq.melted$parent,"\'",
sep='')]
dt.top=data.table(parent=NA,word="TriGram",value=sum(dt.total$value))
dt.freq.merged<-rbind(dt.freq.melted,dt.total,dt.top)
dt.freq.merged[,per:=dt.freq.merged$value]
triwordTree <- gvisTreeMap(dt.freq.merged,
idvar="word",
parentvar="parent",
"value",
options=list(height=700, width=450))
mergeTree <- gvisMerge(biwordTree,triwordTree,horizontal = TRUE)
plot(mergeTree)
left-clickto See source document breakdown, Right-clickto go back
Conclusion and future work
With the given analysis, it is now possible to build a next-word language model. Several approaches will be evaluated:
Traditional language model
Hidden Markov model
Katz's back-off model
Deep Learning Language Model
Recurrent Neural Networks
word embeddings
To measure model fit, training data will be segmented with a leave-one-out cross validation strategy. Since twitter and news article data are statistically least alike, either could be used as the validation set to measure model effectiveness on an independent data set.