Test file

Social Media Training Data Deep-Dive

Author: Jacob Govshteyn Date: Saturday, July 25, 2015

synopsis

Through exploratory data analysis a thorough understanding of the training data set will aid in the development of an accurate next-word perdition model. A deep dive of the of the three social media data sets should shed light on possible techniques needed to sample, mine, and parse data from each set.

Data Acquisition

### Import required libraries.
library(knitr)
opts_chunk$set(bootstrap.show.code=FALSE, bootstrap.thumbnail=FALSE, bootstrap.show.message=FALSE)
library(xtable)
library(corrgram)
library(data.table)
library(wordcloud)
library("RColorBrewer") 
library(RWeka)
library("data.table")
library(tm)
library(treemap)
library(googleVis)

Loading Dataset

The data set consist of a natural language text pool of three sources from HC Corpora

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt
#Create data directory to house  files  

linesExtract<-function(filePath,sampleSize=1){
    fd <- file(filePath, "rb")
    file.lines=readLines(fd,skipNul = TRUE,  warn = FALSE, encoding="UTF-8")
    total.lines=length(file.lines)
    if(sampleSize<1){
        file.sample <- file.lines[sample(1:total.lines, sampleSize*total.lines,replace=FALSE)]
        rm(file.lines) 
    }
    else{
        file.sample=file.lines     
    }    
    close(fd)
    return(file.sample) 
}
txtDir='C:/Users/jacob/Downloads/cap/en_US'
cygDir='cygdrive/c/Users/jacob/Downloads/cap/en_US'
set.seed(257)

sampleSize=.01

txtName.blogs='en_US.blogs.txt'
sample.blogs<-linesExtract(paste(txtDir,txtName.blogs,sep='/'),sampleSize)
summary.blogs<-system(paste("wc ",cygDir,txtName.blogs,sep='/'),intern = TRUE)
summary.blogs<-as.double(unlist(strsplit(summary.blogs,"\\s+"))[2:4])

txtName.tweets='en_US.twitter.txt'
sample.tweets<-linesExtract(paste(txtDir,txtName.tweets,sep='/'),sampleSize)
summary.tweets<-system(paste("wc ",cygDir,txtName.tweets,sep='/'),intern = TRUE)
summary.tweets<-as.double(unlist(strsplit(summary.tweets,"\\s+"))[2:4])

txtName.news='en_US.news.txt'
sample.news<-linesExtract(paste(txtDir,txtName.news,sep='/'),sampleSize)
summary.news<-system(paste("wc ",cygDir,txtName.news,sep='/'),intern = TRUE)
summary.news<-as.double(unlist(strsplit(summary.news,"\\s+"))[2:4])


if(!file.exists("./data")){
    dir.create("./data")
}
## download and profanity list
if(!file.exists("./data/badWorder.txt")){
    badWorderUrl <- "https://gist.githubusercontent.com/jamiew/1112488/raw/ce4448de3f7e2abcdba00ff019f7a41e2e99910c/google_twunter_lol"
    download.file(badWorderUrl, "./data/badWorder.txt")
}
badWorder=read.delim("./data/badWorder.txt",sep = ":",header = FALSE,stringsAsFactors = FALSE)
badWorder=badWorder[2:(nrow(badWorder)-1),1]

English Text Corpus Summary Statistics

dt.summary<-data.table(rbind(summary.blogs,summary.tweets,summary.news))
dt.summary<-cbind(c("Blogs","Tweets","Newspapers"),dt.summary)
names(dt.summary)<-c("Category","Lines","Words","Characters")
dt.total<-dt.summary[,lapply(.SD, sum, na.rm=TRUE),  .SDcols=c("Lines", "Words", "Characters") ]
dt.total[,Category:=c("Totals")]
setcolorder(dt.total, c("Category","Lines", "Words", "Characters"))
dt.summary<-rbind(dt.summary,dt.total)
dt.summary[,"Mean Words Per Line":= dt.summary$Words/dt.summary$Lines]
tab <- xtable(dt.summary,digits=c(0,0, 0, 0, 0, 2))
print(tab,
      include.rownames=FALSE, 
      type="html",
      format.args = list(big.mark = ",", decimal.mark = "."))
Category Lines Words Characters Mean Words Per Line
Blogs 899,288 37,334,114 210,160,014 41.52
Tweets 2,360,148 30,359,852 167,105,338 12.86
Newspapers 1,010,242 34,365,936 205,811,889 34.02
Totals 4,269,678 102,059,902 583,077,241 23.90

Exploratory analysis

Due to memory limitation, a 1% sample of each source will be used for analysis.

Tokenization

From the raw pool of text, each document is divided into a sequence of tokens, which roughly correspond to "words". In order to properly index each token, several normalization tasks are applied to the set of tokens.

Data source exploration

Obviously each entry in the twitter doc is a single tweet, but it's not clear how the blog and news entries are derived. It's unclear whether blog and news entries are user comments, random sentences, or whole entries. A crude, yet effective method to query the entries would be to simply Google some random entries from each set.

set.seed(257)
query.sample.news=sample(1:length(sample.news),1)
query.sample.blogs=sample(1:length(sample.blogs),1)
sample.blogs[query.sample.blogs]
sample.news[query.sample.news]

From examination, each blog and news entry is segmented by newlines(). This isn't a line in a traditional sense, from beginning to end of the page, but roughly equals a paragraph. Since tweets are limited to 144 characters, this helps explain the difference in mean words per Line between tweets and news/blog entries.

Analysis of Vocabulary

In order to build an accurate model, a well-represented slice of the population must be used in the compilation of training and test data. Analysis of the three documents should reveal localized grammar bias specific to any individual sources, and correlation between any two sources.

corpusClean <- function(file.Corpus) {
    #remove select punctuation 
    removePunctuation<- function(x) {
        x<- gsub("[^[:alnum:][:space:]\'-]", "", x)
        gsub("(\\w[\'-]\\w)|[[:punct:]]", "\\1", x)  
    }
    #fix apostrophe Gremlins.  
    fixGremlins <- function(x) {
        gsub("[\u0091\u0092]", "'", x) 
    }
    #repeat words
    removeRepeat <- function(x) {
        gsub("[[:space:]](\\w+)[[:space:]]\\1", " \\1", x) 
    }
    #url
    removeUrl <- function(x) {
        gsub("http\\w+|www\\w+", "", x) 
    }
    #html\xml
    removeMarkup <- function(x) {
        gsub("<\\w+>","",x)
    }
    #twitter meta data removal
    removeMeta <- function(x) {
        #@UserName
        x <- gsub("@\\w+", "", x)
        #hashtag
        x <- gsub("#\\w+", "", x)
        #retweets\via
        gsub("^(RT|via)|(RT :)", "", x)
    }
    removeNumeric <- function(x) {
        gsub("(\\w*[0-9]+\\w*)", "", x)
    }
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeNumeric))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeMeta))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeMarkup))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removePunctuation))
    file.Corpus <- tm_map(file.Corpus, content_transformer(fixGremlins))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeUrl))
    file.Corpus <- tm_map(file.Corpus, content_transformer(tolower))
    file.Corpus <- tm_map(file.Corpus, removeWords, badWorder)
    file.Corpus <- tm_map(file.Corpus, content_transformer(removePunctuation))

}
cleanFreq<-function(object.sample,n=1){
        # special token for apostrophe
        APS     <- "__ap__"                      
        getGrams <- function(x) NGramTokenizer(x, Weka_control(min=n,max=n))
        subtApo <- function(x) {
            gsub("'", APS, x)
        }
        resApo <- function(x){
            string <- gsub(APS, "'", x)
        }
        #Creat and clean corpus object from sample
        object.corpus <- corpusClean(VCorpus(VectorSource(object.sample)))
        object.corpus<- tm_map(object.corpus, content_transformer(subtApo))
        object.tdm <- TermDocumentMatrix(object.corpus,
                                       control = list(stopwords=FALSE, 
                                                      wordLengths = c(1,100),
                                                      tokenize=getGrams))
        m <- as.matrix(object.tdm)
        # get word counts in decreasing order
        word_freqs = sort(rowSums(m), decreasing=TRUE) 
        # create a data table with words and their frequencies
        dt.object= data.table(word=resApo(names(word_freqs)), freq=word_freqs)
        return(dt.object)
        
}
#Creat clean frequencies object from sample  
dt.freq.tweets<-cleanFreq(sample.tweets)
dt.freq.tweets[,freq:=dt.freq.tweets$freq/sum(dt.freq.tweets$freq)]
dt.freq.news<-cleanFreq(sample.news)
dt.freq.news[,freq:=dt.freq.news$freq/sum(dt.freq.news$freq)]
dt.freq.blogs<-cleanFreq(sample.blogs)
dt.freq.blogs[,freq:=dt.freq.blogs$freq/sum(dt.freq.blogs$freq)]

dt.freq.merged=merge(merge(dt.freq.tweets,dt.freq.blogs,by="word",all=TRUE),
                     dt.freq.news,by="word",all=TRUE)
for (j in names(dt.freq.merged))
    set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
wordThresh=.005
setkey(dt.freq.merged,tweets,blogs,news)
op <- options(gvis.plot.tag = "chart")
Bar <- gvisBarChart(dt.freq.merged[tweets>wordThresh|blogs>wordThresh|news>wordThresh],
                    options=list(height=900, width=900,title='Top Word Frequency in Each Document')) 
plot(Bar)

From the word frequency graph it's clear that vocabulary differs between the text sources.

set.seed(456)
rterms <- as.matrix(dt.freq.merged[,c("tweets","blogs","news"),with=F])
rownames(rterms) <- dt.freq.merged$word
par(mfrow=c(1,2))
comparison.cloud(rterms,min.freq=300, 
                 colors=brewer.pal(8, "Dark2"), 
                 rot.per=.5,
                 title.size = 2,
                 scale=c(8,1))
title(main=paste("Difference in Word Frequencies"))
commonality.cloud(rterms,
                 colors=rev(brewer.pal(8, "Dark2")),
                 rot.per=.5,
                 title.size = 2,
                 scale=c(8,1))
title(main=paste("Common Word Frequencies"))

One obvious difference is the Narrative point-of-view.

  • Tweets tend to use first and second person perspective.
    • I
    • me
    • you
  • News articles are almost exclusively told from a third person perspective.
    • he
    • his
  • Blogs have a mixture of third and first person Narrative.
    • she
    • them
    • we

There's also a difference in gender specific terminology. News articles are more masculine oriented, with high frequency of terms like "he" and "his", while blogs have more feminine terms like "she" and "her".

corrgram(dt.freq.merged, order=TRUE, lower.panel=panel.pie, upper.panel=panel.conf)
title(main=paste("Corrilaiton of word frequencies"))

Based on correlation of word frequencies, tweets and news articles are least alike, while blog vocabulary equally resembles tweets and news articles, with about a 95% correlation to each.

Analysis of Language

#Creat frequencies Bigram data tables
dt.freq.tweets<-cleanFreq(sample.tweets,2)
dt.freq.news<-cleanFreq(sample.news,2)
dt.freq.blogs<-cleanFreq(sample.blogs,2)

#merge each source text
dt.freq.merged=merge(merge(dt.freq.tweets[freq>15],
                           dt.freq.blogs[freq>15],
                           by="word",
                           all=TRUE),
                     dt.freq.news[freq>15],
                     by="word",
                     all=TRUE)
for (j in names(dt.freq.merged))
    set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
#calculate totals of frequency
dt.total=dt.freq.merged[, .SD$tweets+.SD$blogs+.SD$news, by=word ]
dt.total$parent="BiGram"
names(dt.total)<-c("word","value","parent")
dt.freq.melted<-melt(dt.freq.merged)
names( dt.freq.melted)<-c("parent","word","value")
#break down each frequency by source doc
dt.freq.melted[,word:=paste(dt.freq.melted$word,"-\'",
                            dt.freq.melted$parent,"\'",
                            sep='')]
dt.top=data.table(parent=NA,word="BiGram",value=sum(dt.total$value))
dt.freq.merged<-rbind(dt.freq.melted,dt.total,dt.top)
dt.freq.merged[,per:=dt.freq.merged$value]
biwordTree <- gvisTreeMap(dt.freq.merged,
                          idvar="word", 
                          parentvar="parent",
                          "value",
                          options=list(height=700, width=450))
#Creat frequencies Trigram data tables
dt.freq.tweets<-cleanFreq(sample.tweets,3)
dt.freq.news<-cleanFreq(sample.news,3)
dt.freq.blogs<-cleanFreq(sample.blogs,3)

#merge each source text
dt.freq.merged=merge(merge(dt.freq.tweets[freq>3],
                           dt.freq.blogs[freq>3],
                           by="word",
                           all=TRUE),
                     dt.freq.news[freq>3],
                     by="word",
                     all=TRUE)
for (j in names(dt.freq.merged))
    set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
#calculate totals of frequency
dt.total=dt.freq.merged[, .SD$tweets+.SD$blogs+.SD$news, by=word ]
dt.total$parent="TriGram"
names(dt.total)<-c("word","value","parent")
dt.freq.melted<-melt(dt.freq.merged)
names( dt.freq.melted)<-c("parent","word","value")
#break down each frequency by source doc
dt.freq.melted[,word:=paste(dt.freq.melted$word,"-\'",
                            dt.freq.melted$parent,"\'",
                            sep='')]
dt.top=data.table(parent=NA,word="TriGram",value=sum(dt.total$value))
dt.freq.merged<-rbind(dt.freq.melted,dt.total,dt.top)
dt.freq.merged[,per:=dt.freq.merged$value]
triwordTree <- gvisTreeMap(dt.freq.merged, 
                           idvar="word", 
                           parentvar="parent",
                          "value",
                           options=list(height=700, width=450))

mergeTree <- gvisMerge(biwordTree,triwordTree,horizontal = TRUE)
plot(mergeTree)

left-click to See source document breakdown, Right-click to go back

Conclusion and future work

With the given analysis, it is now possible to build a next-word language model. Several approaches will be evaluated:

  1. Traditional language model
    • Hidden Markov model
    • Katz's back-off model
  2. Deep Learning Language Model
    • Recurrent Neural Networks
    • word embeddings
To measure model fit, training data will be segmented with a leave-one-out cross validation strategy. Since twitter and news article data are statistically least alike, either could be used as the validation set to measure model effectiveness on an independent data set.