Social Media Training Data Deep-Dive

Author: Jacob Govshteyn Date: Saturday, July 25, 2015

synopsis

Through exploratory data analysis a thorough understanding of the training data set will aid in the development of an accurate next-word perdition model. A deep dive of the of the three social media data sets should shed light on possible techniques needed to sample, mine, and parse data from each set.

Data Acquisition

### Import required libraries.
library(knitr)
opts_chunk$set(bootstrap.show.code=FALSE, bootstrap.thumbnail=FALSE, bootstrap.show.message=FALSE)
library(xtable)
library(corrgram)
library(data.table)
library(wordcloud)
library("RColorBrewer") 
library(RWeka)
library("data.table")
library(tm)
library(treemap)
library(googleVis)

Loading Dataset

The data set consist of a natural language text pool of three sources from HC Corpora

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

#Create data directory to house  files  

linesExtract<-function(filePath,sampleSize=1){
    fd <- file(filePath, "rb")
    file.lines=readLines(fd,skipNul = TRUE,  warn = FALSE, encoding="UTF-8")
    total.lines=length(file.lines)
    if(sampleSize<1){
        file.sample <- file.lines[sample(1:total.lines, sampleSize*total.lines,replace=FALSE)]
        rm(file.lines) 
    }
    else{
        file.sample=file.lines     
    }    
    close(fd)
    return(file.sample) 
}
txtDir='C:/Users/jacob/Downloads/cap/en_US'
cygDir='cygdrive/c/Users/jacob/Downloads/cap/en_US'
set.seed(257)

sampleSize=.01

txtName.blogs='en_US.blogs.txt'
sample.blogs<-linesExtract(paste(txtDir,txtName.blogs,sep='/'),sampleSize)
summary.blogs<-system(paste("wc ",cygDir,txtName.blogs,sep='/'),intern = TRUE)
summary.blogs<-as.double(unlist(strsplit(summary.blogs,"\\s+"))[2:4])

txtName.tweets='en_US.twitter.txt'
sample.tweets<-linesExtract(paste(txtDir,txtName.tweets,sep='/'),sampleSize)
summary.tweets<-system(paste("wc ",cygDir,txtName.tweets,sep='/'),intern = TRUE)
summary.tweets<-as.double(unlist(strsplit(summary.tweets,"\\s+"))[2:4])

txtName.news='en_US.news.txt'
sample.news<-linesExtract(paste(txtDir,txtName.news,sep='/'),sampleSize)
summary.news<-system(paste("wc ",cygDir,txtName.news,sep='/'),intern = TRUE)
summary.news<-as.double(unlist(strsplit(summary.news,"\\s+"))[2:4])


if(!file.exists("./data")){
    dir.create("./data")
}
## download and profanity list
if(!file.exists("./data/badWorder.txt")){
    badWorderUrl <- "https://gist.githubusercontent.com/jamiew/1112488/raw/ce4448de3f7e2abcdba00ff019f7a41e2e99910c/google_twunter_lol"
    download.file(badWorderUrl, "./data/badWorder.txt")
}
badWorder=read.delim("./data/badWorder.txt",sep = ":",header = FALSE,stringsAsFactors = FALSE)
badWorder=badWorder[2:(nrow(badWorder)-1),1]

English Text Corpus Summary Statistics

dt.summary<-data.table(rbind(summary.blogs,summary.tweets,summary.news))
dt.summary<-cbind(c("Blogs","Tweets","Newspapers"),dt.summary)
names(dt.summary)<-c("Category","Lines","Words","Characters")
dt.total<-dt.summary[,lapply(.SD, sum, na.rm=TRUE),  .SDcols=c("Lines", "Words", "Characters") ]
dt.total[,Category:=c("Totals")]
setcolorder(dt.total, c("Category","Lines", "Words", "Characters"))
dt.summary<-rbind(dt.summary,dt.total)
dt.summary[,"Mean Words Per Line":= dt.summary$Words/dt.summary$Lines]
tab <- xtable(dt.summary,digits=c(0,0, 0, 0, 0, 2))
print(tab,
      include.rownames=FALSE, 
      type="html",
      format.args = list(big.mark = ",", decimal.mark = "."))

Category	Lines	Words	Characters	Mean Words Per Line
Blogs	899,288	37,334,114	210,160,014	41.52
Tweets	2,360,148	30,359,852	167,105,338	12.86
Newspapers	1,010,242	34,365,936	205,811,889	34.02
Totals	4,269,678	102,059,902	583,077,241	23.90

Exploratory analysis

Due to memory limitation, a 1% sample of each source will be used for analysis.

Tokenization

From the raw pool of text, each document is divided into a sequence of tokens, which roughly correspond to "words". In order to properly index each token, several normalization tasks are applied to the set of tokens.

Remove select punctuation
Fix apostrophe Gremlins.
Remove url
Remove html and xml
Remove profanity words (List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)
Remove twitter meta data
- @UserName
- hashtag
- retweets and via
Convert all text to lowercase

Data source exploration

Obviously each entry in the twitter doc is a single tweet, but it's not clear how the blog and news entries are derived. It's unclear whether blog and news entries are user comments, random sentences, or whole entries. A crude, yet effective method to query the entries would be to simply Google some random entries from each set.

set.seed(257)
query.sample.news=sample(1:length(sample.news),1)
query.sample.blogs=sample(1:length(sample.blogs),1)
sample.blogs[query.sample.blogs]
sample.news[query.sample.news]

From examination, each blog and news entry is segmented by newlines(). This isn't a line in a traditional sense, from beginning to end of the page, but roughly equals a paragraph. Since tweets are limited to 144 characters, this helps explain the difference in mean words per Line between tweets and news/blog entries.

Analysis of Vocabulary

In order to build an accurate model, a well-represented slice of the population must be used in the compilation of training and test data. Analysis of the three documents should reveal localized grammar bias specific to any individual sources, and correlation between any two sources.

corpusClean <- function(file.Corpus) {
    #remove select punctuation 
    removePunctuation<- function(x) {
        x<- gsub("[^[:alnum:][:space:]\'-]", "", x)
        gsub("(\\w[\'-]\\w)|[[:punct:]]", "\\1", x)  
    }
    #fix apostrophe Gremlins.  
    fixGremlins <- function(x) {
        gsub("[\u0091\u0092]", "'", x) 
    }
    #repeat words
    removeRepeat <- function(x) {
        gsub("[[:space:]](\\w+)[[:space:]]\\1", " \\1", x) 
    }
    #url
    removeUrl <- function(x) {
        gsub("http\\w+|www\\w+", "", x) 
    }
    #html\xml
    removeMarkup <- function(x) {
        gsub("<\\w+>","",x)
    }
    #twitter meta data removal
    removeMeta <- function(x) {
        #@UserName
        x <- gsub("@\\w+", "", x)
        #hashtag
        x <- gsub("#\\w+", "", x)
        #retweets\via
        gsub("^(RT|via)|(RT :)", "", x)
    }
    removeNumeric <- function(x) {
        gsub("(\\w*[0-9]+\\w*)", "", x)
    }
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeNumeric))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeMeta))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeMarkup))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removePunctuation))
    file.Corpus <- tm_map(file.Corpus, content_transformer(fixGremlins))
    file.Corpus <- tm_map(file.Corpus, content_transformer(removeUrl))
    file.Corpus <- tm_map(file.Corpus, content_transformer(tolower))
    file.Corpus <- tm_map(file.Corpus, removeWords, badWorder)
    file.Corpus <- tm_map(file.Corpus, content_transformer(removePunctuation))

}
cleanFreq<-function(object.sample,n=1){
        # special token for apostrophe
        APS     <- "__ap__"                      
        getGrams <- function(x) NGramTokenizer(x, Weka_control(min=n,max=n))
        subtApo <- function(x) {
            gsub("'", APS, x)
        }
        resApo <- function(x){
            string <- gsub(APS, "'", x)
        }
        #Creat and clean corpus object from sample
        object.corpus <- corpusClean(VCorpus(VectorSource(object.sample)))
        object.corpus<- tm_map(object.corpus, content_transformer(subtApo))
        object.tdm <- TermDocumentMatrix(object.corpus,
                                       control = list(stopwords=FALSE, 
                                                      wordLengths = c(1,100),
                                                      tokenize=getGrams))
        m <- as.matrix(object.tdm)
        # get word counts in decreasing order
        word_freqs = sort(rowSums(m), decreasing=TRUE) 
        # create a data table with words and their frequencies
        dt.object= data.table(word=resApo(names(word_freqs)), freq=word_freqs)
        return(dt.object)
        
}
#Creat clean frequencies object from sample  
dt.freq.tweets<-cleanFreq(sample.tweets)
dt.freq.tweets[,freq:=dt.freq.tweets$freq/sum(dt.freq.tweets$freq)]
dt.freq.news<-cleanFreq(sample.news)
dt.freq.news[,freq:=dt.freq.news$freq/sum(dt.freq.news$freq)]
dt.freq.blogs<-cleanFreq(sample.blogs)
dt.freq.blogs[,freq:=dt.freq.blogs$freq/sum(dt.freq.blogs$freq)]

dt.freq.merged=merge(merge(dt.freq.tweets,dt.freq.blogs,by="word",all=TRUE),
                     dt.freq.news,by="word",all=TRUE)
for (j in names(dt.freq.merged))
    set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
wordThresh=.005

setkey(dt.freq.merged,tweets,blogs,news)
op <- options(gvis.plot.tag = "chart")
Bar <- gvisBarChart(dt.freq.merged[tweets>wordThresh|blogs>wordThresh|news>wordThresh],
                    options=list(height=900, width=900,title='Top Word Frequency in Each Document')) 
plot(Bar)

From the word frequency graph it's clear that vocabulary differs between the text sources.

set.seed(456)
rterms <- as.matrix(dt.freq.merged[,c("tweets","blogs","news"),with=F])
rownames(rterms) <- dt.freq.merged$word
par(mfrow=c(1,2))
comparison.cloud(rterms,min.freq=300, 
                 colors=brewer.pal(8, "Dark2"), 
                 rot.per=.5,
                 title.size = 2,
                 scale=c(8,1))
title(main=paste("Difference in Word Frequencies"))
commonality.cloud(rterms,
                 colors=rev(brewer.pal(8, "Dark2")),
                 rot.per=.5,
                 title.size = 2,
                 scale=c(8,1))
title(main=paste("Common Word Frequencies"))

One obvious difference is the Narrative point-of-view.

Tweets tend to use first and second person perspective.
- I
- me
- you
News articles are almost exclusively told from a third person perspective.
- he
- his
Blogs have a mixture of third and first person Narrative.
- she
- them
- we

There's also a difference in gender specific terminology. News articles are more masculine oriented, with high frequency of terms like "he" and "his", while blogs have more feminine terms like "she" and "her".

corrgram(dt.freq.merged, order=TRUE, lower.panel=panel.pie, upper.panel=panel.conf)
title(main=paste("Corrilaiton of word frequencies"))

Based on correlation of word frequencies, tweets and news articles are least alike, while blog vocabulary equally resembles tweets and news articles, with about a 95% correlation to each.

Analysis of Language

#Creat frequencies Bigram data tables
dt.freq.tweets<-cleanFreq(sample.tweets,2)
dt.freq.news<-cleanFreq(sample.news,2)
dt.freq.blogs<-cleanFreq(sample.blogs,2)

#merge each source text
dt.freq.merged=merge(merge(dt.freq.tweets[freq>15],
                           dt.freq.blogs[freq>15],
                           by="word",
                           all=TRUE),
                     dt.freq.news[freq>15],
                     by="word",
                     all=TRUE)
for (j in names(dt.freq.merged))
    set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
#calculate totals of frequency
dt.total=dt.freq.merged[, .SD$tweets+.SD$blogs+.SD$news, by=word ]
dt.total$parent="BiGram"
names(dt.total)<-c("word","value","parent")
dt.freq.melted<-melt(dt.freq.merged)
names( dt.freq.melted)<-c("parent","word","value")
#break down each frequency by source doc
dt.freq.melted[,word:=paste(dt.freq.melted$word,"-\'",
                            dt.freq.melted$parent,"\'",
                            sep='')]
dt.top=data.table(parent=NA,word="BiGram",value=sum(dt.total$value))
dt.freq.merged<-rbind(dt.freq.melted,dt.total,dt.top)
dt.freq.merged[,per:=dt.freq.merged$value]
biwordTree <- gvisTreeMap(dt.freq.merged,
                          idvar="word", 
                          parentvar="parent",
                          "value",
                          options=list(height=700, width=450))
#Creat frequencies Trigram data tables
dt.freq.tweets<-cleanFreq(sample.tweets,3)
dt.freq.news<-cleanFreq(sample.news,3)
dt.freq.blogs<-cleanFreq(sample.blogs,3)

#merge each source text
dt.freq.merged=merge(merge(dt.freq.tweets[freq>3],
                           dt.freq.blogs[freq>3],
                           by="word",
                           all=TRUE),
                     dt.freq.news[freq>3],
                     by="word",
                     all=TRUE)
for (j in names(dt.freq.merged))
    set(dt.freq.merged,which(is.na(dt.freq.merged[[j]])),j,0)
names(dt.freq.merged)<-c("word","tweets","blogs","news")
#calculate totals of frequency
dt.total=dt.freq.merged[, .SD$tweets+.SD$blogs+.SD$news, by=word ]
dt.total$parent="TriGram"
names(dt.total)<-c("word","value","parent")
dt.freq.melted<-melt(dt.freq.merged)
names( dt.freq.melted)<-c("parent","word","value")
#break down each frequency by source doc
dt.freq.melted[,word:=paste(dt.freq.melted$word,"-\'",
                            dt.freq.melted$parent,"\'",
                            sep='')]
dt.top=data.table(parent=NA,word="TriGram",value=sum(dt.total$value))
dt.freq.merged<-rbind(dt.freq.melted,dt.total,dt.top)
dt.freq.merged[,per:=dt.freq.merged$value]
triwordTree <- gvisTreeMap(dt.freq.merged, 
                           idvar="word", 
                           parentvar="parent",
                          "value",
                           options=list(height=700, width=450))

mergeTree <- gvisMerge(biwordTree,triwordTree,horizontal = TRUE)
plot(mergeTree)

left-click to See source document breakdown, Right-click to go back

Conclusion and future work

With the given analysis, it is now possible to build a next-word language model. Several approaches will be evaluated:

Traditional language model
- Hidden Markov model
- Katz's back-off model
Deep Learning Language Model
- Recurrent Neural Networks
- word embeddings

To measure model fit, training data will be segmented with a leave-one-out cross validation strategy. Since twitter and news article data are statistically least alike, either could be used as the validation set to measure model effectiveness on an independent data set.