Exploratory Data Analysis of Text Files- Milestone Report

I downloaded the raw data files and read them into R. The first file contains text from twitter, the second file contains text from blogs and the third file contains text from news reports. All the texts are in English.

Each line in the text files corresponds to a separate item, and is considered as a component of the vector containing the data. For example, the first two items from each file are:

twitter[1:2]

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

blogs[1:2]

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ."
## [2] "We love you Mr. Brown."

news[1:2]

## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

The twitter data has length 2360148, the blogs data has length 899288, and the news data has length 77259.
The twitter data has many more items than the other sources, but since tweets are,in general, much shorter than other texts, we need many more items in order to sample as many words as we would be able to get from other sources with much fewer, but longer items. (The above examples are not representative of the length of most texts.) For the moment I restrict the exploration of the distribution of words to only about 5% of the items in the datasets, since they are very large and require much time to be manipulated. The items considered are sampled randomly from the full datasets.

The next step I took was to remove profanities from the texts. I did this by reading into R a blacklist of words I found, and then using the removeWords function from the tm package. I considered removing also “stopwords”, but decided not to do so, because these words must be useful when predicting the next word in a sentence. Following this, I transformed all upper case letters to lower case and removed all characters apart from spaces, alphabetical characters and apostrophes. I also stemmed the words and removed unnecessary whitespace.

TermDocumentMatrix from the tm package together with a tokenizer from the RWeka package were used to create counts of ngrams of different sizes: words, two-word phrases and three-word phrases.

The term document matrices count the number of instances of each term in each of the data items. The number of terms (ngrams of size n=1,2,3) found in each dataset is equal to the number of rows of the matrix. The following table summarizes the number of ngrams found:

##   data_source one_grams two_grams three_grams
## 1     twitter     54181    479727      930049
## 2       blogs      8324     47249       68482
## 3        news      4200     14742       17488

The numbers of ngrams are very large, and the term document matrices are very sparse, because there are many words, or word combinations, that occur only once in the whole data set. If a word occurs only once in 118000 separate texts, it either a very rarely used word, or a word that has been misspelled, or it is a non-word. (A non-word is a random combination of letters or jibberish that appears in the text.) Since the matrices created from the twitter source are extermely large, I decided to remove terms that appear only once, to allow easier manipulation of the matrices that take up much memory. The other matrices are of manageable sizes.

The number of ngrams is now:
found:

##   data_source one_grams two_grams three_grams
## 1     twitter     19121    109124       92512
## 2       blogs      8324     47249       68482
## 3        news      4200     14742       17488

Once the termdocument matrices have been created, the row sums give the frequency that each term occurs in the texts that were inspected. Summary statistics for the frequencies of terms in each matrix:

##            tdm1t    tdm1b    tdm1n    tdm2t   tdm2b   tdm2n   tdm3t  tdm3b
## Min         2.00    1.000    1.000    2.000   1.000   1.000    2.00  1.000
## 1st Qu.     2.00    1.000    1.000    2.000   1.000   1.000    2.00  1.000
## Median      5.00    1.000    1.000    3.000   1.000   1.000    2.00  1.000
## Mean       56.19    7.174    3.632    9.149   1.596   1.253    4.47  1.073
## 3rd Qu.    15.00    4.000    3.000    6.000   1.000   1.000    4.00  1.000
## Max     46781.00 3859.000 1126.000 6408.000 375.000 129.000 1213.00 27.000
##         tdm3n
## Min     1.000
## 1st Qu. 1.000
## Median  1.000
## Mean    1.024
## 3rd Qu. 1.000
## Max     8.000

Barplots displaying frequencies of the 50 most frequent terms from each text source:

Most frequent 1grams (single words)

I merged the term frequency results from the three sources, and plotted a wordcloud showing the 100 overall most frequent 1grams:

Most frequent 2grams

Barplots of 50 most frequent two word phrases from each source, and wordcloud of 100 most frequent two word phrases from all sources combined:

Most frequent 3grams

Similarly for three word phrases:

Model for prediction

My idea is to look at two previous words before the word that is to be predicted. If I have a 3gram that begins with these two words, my prediction will be the third word in the 3gram. If there are several 3grams that begin with these two words, I’ll choose the one with the highest frequency observed.

If there is no 3gram, I’ll “back off” only to the last word, and look for a 2gram that begins with this word, again choosing the one with the highest freqency.

If there is no such observed 2 gram I’ll “back off” to the 1gram list and look for the word that has highest correlation with the last word.

Appendix - R Code used to create the document

library(tm)
library(RWeka)
library(wordcloud)
library(SnowballC)
library(ggplot2)
#read text files into vectors
twitter<-readLines("en_US.twitter.txt")
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
#to find number of lines in text file go to cmd line and use wc -l filename.txt
#number of lines : en_US.twitter.txt  - 2360148
#                  en_US.blog.txt - 40835
#                  en_US.news.txt - 11384


set.seed(11568)
#random samples of approx. 5% of data
t_samp <- twitter[sample(length(twitter),118000)]
b_samp <- blogs[sample(length(blogs),2000)]
n_samp <- news[sample(length(news),569)]

#create a source for texts
sourcet <- VectorSource(t_samp)
sourceb <- VectorSource(b_samp)
sourcen <- VectorSource(n_samp)
#create corpus of texts
corpust <- Corpus(sourcet)
corpusb <- Corpus(sourceb)
corpusn <- Corpus(sourcen)

#remove profanities 
#use blacklist found on internet
bad<-readLines("https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
corpust <- tm_map(corpust, removeWords, bad)
corpusb <- tm_map(corpusb, removeWords, bad)
corpusn <- tm_map(corpusn, removeWords, bad)

#clean up text
corpust <- tm_map(corpust, content_transformer(tolower))
corpust <- tm_map(corpust, content_transformer(function(x)gsub("[^ a-zA-Z\']","",x) ))
corpust <- tm_map(corpust, stemDocument)
corpust <- tm_map(corpust,stripWhitespace)
corpusb <- tm_map(corpusb, content_transformer(tolower))
corpusb <- tm_map(corpusb, content_transformer(function(x)gsub("[^ a-zA-Z\']","",x)))
corpusb <- tm_map(corpusb, stemDocument)
corpusb <- tm_map(corpusb,stripWhitespace)
corpusn <- tm_map(corpusn, content_transformer(tolower))
corpusn <- tm_map(corpusn, content_transformer(function(x)gsub("[^ a-zA-Z\']","",x) ))
corpusn <- tm_map(corpusn, stemDocument)
corpusn <- tm_map(corpusn,stripWhitespace)

corpust <- tm_map(corpust,PlainTextDocument)
corpusb <- tm_map(corpusb,PlainTextDocument)
corpusn <- tm_map(corpusn,PlainTextDocument)

#create document term matrices
oneTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm1t <- TermDocumentMatrix(corpust, control = list(tokenize = oneTokenizer))
tdm1b <- TermDocumentMatrix(corpusb, control = list(tokenize = oneTokenizer))
tdm1n <- TermDocumentMatrix(corpusn, control = list(tokenize = oneTokenizer))

twoTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm2t <- TermDocumentMatrix(corpust, control = list(tokenize = twoTokenizer))
tdm2b <- TermDocumentMatrix(corpusb, control = list(tokenize = twoTokenizer))
tdm2n <- TermDocumentMatrix(corpusn, control = list(tokenize = twoTokenizer))

threeTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm3t <- TermDocumentMatrix(corpust, control = list(tokenize = threeTokenizer))
tdm3b <- TermDocumentMatrix(corpusb, control = list(tokenize = threeTokenizer))
tdm3n <- TermDocumentMatrix(corpusn, control = list(tokenize = threeTokenizer))

#number of rows in each matrix = number of terms
mats <- c("tdm1t","tdm1b","tdm1n","tdm2t","tdm2b","tdm2n","tdm3t","tdm3b","tdm3n")
freqs <- sapply(mats,function(x)dim(get(x))[1])
data.frame(data_source=c("twitter","blogs","news"),one_grams=freqs[1:3],two_grams=freqs[4:6],three_grams=freqs[7:9],row.names=NULL)

#remove terms that appear only once in the twitter texts
tdm1t <- removeSparseTerms(tdm1t,1-1/117999)
tdm2t <- removeSparseTerms(tdm2t,1-1/117999)
tdm3t <- removeSparseTerms(tdm3t,1-1/117999)

#number of rows now in the matrices
freqs <- sapply(mats,function(x)dim(get(x))[1])
data.frame(data_source=c("twitter","blogs","news"),one_grams=freqs[1:3],two_grams=freqs[4:6],three_grams=freqs[7:9],row.names=NULL)

#The matrices derived from the twitter source are still too large to use the rowSums 
#function, and must be inspected row by row using the following function. 
#This takes quite some time!
getfreq <- function(tdm){
        freqs <- NULL
        i<-1
        r<-nrow(tdm)
        while (i <= r){
                temp <- as.vector(tdm[i,])
                counts <- sum(temp)
                freqs <- c(freqs,counts)
                i <- i+1
                
        }

        return(freqs)
}

termFreq <- list("m1t"=NULL,"m1b"=NULL,"m1n"=NULL,"m2t"=NULL,"m2b"=NULL,"m2n"=NULL,"m3t"=NULL,"m3b"=NULL,"m3n"=NULL)
for (j in c(1,4,7)){
        tdm<-get(paste0("td",names(termFreq)[j]))
        termFreq[[j]]<- data.frame(Terms=tdm$dimnames$Terms,Freq=getfreq(tdm))
}
for (j in c(2,3,5,6,8,9)){
        tdm<-get(paste0("td",names(termFreq)[j]))
        termFreq[[j]]<- data.frame(Terms=tdm$dimnames$Terms,Freq=rowSums(as.matrix(tdm)))
}

# display summary for each matrix
summ <- sapply(termFreq,function(x) summary(x)[,2])
summ<-matrix(sapply(summ,function(x)as.numeric(strsplit(x,":")[[1]][2])),7,9)
summ <- summ[1:6,]
rownames(summ)<-c("Min","1st Qu.","Median","Mean","3rd Qu.", "Max")
colnames(summ)<-mats

#create barplots of most frequent 1grams
bplot <- function(dat,i,j){
        n <- c("1grams","2grams","3grams")
        t <- c("Twitter","Blogs", "News","all")
        p <- ggplot(dat, aes(Terms, Freq)) 
        p <- p + geom_bar(stat="identity")  
        p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))  
        p <- p +ggtitle(paste(n[i],"from",t[j],"Texts"))
}

#reorder frequency lists by decreasing frequencies
ordered <- lapply(termFreq,function(x)x[order(-x[,2]),])

for (j in 1:3){
        p <- bplot(ordered[[j]][1:50,],1,j)
        print(p)
}

#merge 1grams from all three sources and create wordcloud of 100 most common words
merger <- function(x,y,z){
        merged <- merge(x,y,all=TRUE,by="Terms")
        merged <- merge(merged,z,all=TRUE,by="Terms")
        merged[,2:4][is.na(merged[,2:4])]<-0
        merged$Total=merged[,2]+merged[,3]+merged[,4]
       totFreq <- data.frame(Terms=merged$Terms,Freq=merged$Total)
        return(totFreq[order(-totFreq[,2]),])
}
tFreq <- merger(ordered[[1]],ordered[[2]],ordered[[3]])
wordcloud(tFreq$Terms[1:100], tFreq$Freq[1:100], max.words = 100, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

#2grams
for (j in 4:6){
        p <- bplot(ordered[[j]][1:50,],2,j-3)
        print(p)
}
Freq2 <- merger(ordered[[4]],ordered[[5]],ordered[[6]])
wordcloud(Freq2$Terms[1:100], Freq2$Freq[1:100], max.words = 100, scale=c(5, .1), colors=brewer.pal(6, "Dark2")) 

#3grams
for (j in 7:9){
        p <- bplot(ordered[[j]][1:50,],3,j-6)
        print(p)
}
Freq3 <- merger(ordered[[7]],ordered[[8]],ordered[[9]])
wordcloud(Freq3$Terms[1:100], Freq3$Freq[1:100], max.words = 100, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))