Introduction

This Milestone Repport covers the first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build the first linguistic models:

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

All r code for this report is shown in the appendix. To speed things up, I used parallell processing

Basic Exploratory Data Analysis

The table below show some summary statistics for the training files (Coursera-SwiftKey.zip). Initially, I had two problems:

  1. I got a warning when reading the news file, and the character count did not match the file size. This was solved by reading the file as a binary file.

  2. There were empty lines which I did not want to sample later on. This was solved by using skipNul = TRUE.

File.name File.Size.MB Lines LinesNEmpty Chars CharsNWhite
en_US.blogs.txt 200.42 899288 899288 206824382 170389539
en_US.news.txt 196.28 1010242 1010242 203223154 169860866
en_US.twitter.txt 159.36 2360148 2360148 162096241 134082806

Sample Size

The files are too large for my computer. What is the optimal sample size? It depends. In Fundamentals of Predictive Text Mining - Weiss, Indurkhya and Zhang - 2010, you can read on page 42 “Our main focus is on learning and feedback from an evaluation of results without anticipating the likely success of learning from any size sample.”

In The N in Text Analytics: Text Mining with Different Sample Sizes Tom H. C. Anderson, the founder of Anderson Analytics, says " I wouldn’t say that there’s a minimum size per se, though I would say that the ROI of text analytics increases exponentially with the size of the data … Many of our customers do find text analytics useful for smaller ad-hoc survey data with sample sizes around n = ~1,000 as well."

I did some manual probing to check the consistency across random samples: First I took 10 samples of 1000 lines each from the three files, cleaned the data, and selected the 100 most frequent words from each sample. Then I extracted the common words from all the samples. I repeated this process with 2000, 4000, 8000, 16000, and 32000 lines. The percentages of similar words accross the samples are given in the table below:

There is still an increase of more than five percent when increasing the sample size fro 16000 to 32000. So, I will be using 32000.

Data Cleaning

Overall, stemming will achieve a large reduction in dictionary size and is modestly beneficial for predictive performance when using a smaller dictionary (Page 23, Fundamentals of Predictive Text Mining - Weiss, Indurkhya and Zhang - 2010) So, I begin without stemming. Characters are converted to lowercase, punctuation as well as numbers, white spaces and english stoptwords are removed,

Tokenization

Here I construct document-term matrix from the corpus.

Profanity

In the Habitat Chronicles your can read about the 14-year old boy who within minutes used harmless words to construct the sentence ‘I want to stick my long-necked Giraffe up your fluffy white bunny.’ So, profanity filtering is not easy. Nevertheless, I will filter the documents for words from the following sources:

  1. List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

  2. googles-official-list-of-bad-words

Profane words in percent of all words.
Blogs News Twitter
Relative frequencies 0.21 0.08 0.86

One sees that there are about ten times as many profane words in twitter compared to news, and about two and a half as many in blogs compared to news. The most frequent profane words are hell, shit, sex, damn, fuck, ass, bitch, fucking, balls, and santorum. Profane words are removed, and will not be predicted later on.

Zipf’s law

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

## (Intercept)           x 
##   14.366315   -1.279309

Heaps’ law

Heaps’ law Heaps’ law states that the vocabulary size grows polynomially with the text size.

## (Intercept)           x 
##   2.8558139   0.6117761

Word Clouds

Frequencies of Words

Frequencies of Word Pairs

Frequencies of 3-grams

Frequencies of 4-grams

Next Steps

Appendix

R Code

## Load libraries
library(stringi)
library(knitr)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(slam)
library(ggplot2)        
library(reshape2)       ## flexibly reshape data
library(doParallel)     ## parallel processing

## configure parallel processing
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)

## Read documents
folder <- 'C:/Users/olalie/Documents/DataScience/10_Capstone/Milestone'
setwd(folder)
file_names <- c('en_US.blogs.txt','en_US.news.txt','en_US.twitter.txt')

if (!file.exists('doc_summary')) {
 
   # Read files
    blogs_all   <- readLines('en_US.blogs.txt', encoding = 'UTF-8', skipNul = TRUE)
  
    ## Read news as a binary file to get the whole file
    tmp <- file('en_US.news.txt', open='rb')
    news_all <- readLines(tmp, encoding='UTF-8')
    close(tmp); rm(tmp)
  
    twitter_all <- readLines('en_US.twitter.txt', encoding = 'UTF-8', skipNul = TRUE)

  # Create data frame for file statistics
  doc_summary <- data.frame(matrix(ncol=5,nrow=length(file_names)))
  rownames(doc_summary) <- file_names
  colnames(doc_summary) <- c('File Size MB', 'Lines', 'LinesNEmpty', 'Chars', 'CharsNWhite')

  # File sizes
  for (i in seq(file_names)) doc_summary[i,'File Size MB'] <- file.info(file_names[i])$size/1024^2
  
  # General statistics for a character vector 
  doc_summary[1,2:5] <- stri_stats_general(blogs_all) 
  doc_summary[2,2:5] <- stri_stats_general(news_all) 
  doc_summary[3,2:5] <- stri_stats_general(twitter_all)

  write.table(doc_summary,'doc_summary')
} else {
  doc_summary <- read.table('doc_summary',header=T)
}

kable(doc_summary, digits=2)

## plot percentage of similar words accross samples for different sample sizes.
df <- data.frame(c(54,46,43),c(65,55,52),c(77,67,64),c(81,77,74),c(85,77,81),c(87,89,84),c('blogs','news','twitter'))
colnames(df)<-c('1k','2k','4k','8k','16k','32k','Type')
df <- melt(df, id.vars = 'Type', variable.name = "Size", value.name = "Percentage")
ggplot(df, aes(x=df$Size,y=df$Percentage, fill=df$Type)) +
  xlab('Sample Size') + ylab('Percentage') + 
  ggtitle('Similar words among 100 most frequent words in 10 samples') +
  scale_fill_manual(name='Type', values=c('#E69F00','#56B4E9','#009E73')) +
  geom_bar(stat="identity", position='dodge')
## Create samples
if (!file.exists('samples/blog.txt')) {
  set.seed(23)
  blogs   <- blogs_all[sample(1:length(blogs_all),32000)]
  news    <- news_all[sample(1:length(news_all),32000)]
  twitter <- twitter_all[sample(1:length(twitter_all),32000)]

  # Save Samples
  writeLines(blogs,   'samples/blog.txt')
  writeLines(news,    'samples/news.txt')
  writeLines(twitter, 'samples/twitter.txt')
  
  rm(blogs_all)
  rm(news_all)
  rm(twitter_all)
} else {
  blogs       <- readLines('samples/blog.txt')
  news        <- readLines('samples/news.txt')
  twitter     <- readLines('samples/twitter.txt')
}

## Create corpus and clean data
docs <- Corpus(DirSource('samples'))
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, PlainTextDocument)

## Construct DocumentTermMatrix
dtm  <- DocumentTermMatrix(docs)

## Load bad words, check for bad words in the documents and remove them.
tmp1 <- readLines('http://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en')
tmp2 <- readLines('http://gist.githubusercontent.com/ryanlewis/a37739d710ccdb4b406d/raw/0fbd315eb2900bb736609ea894b9bde8217b991a/google_twunter_lol')
bad_words <- c(tmp1,tmp2)
bad_words <-unique(bad_words)
rm(tmp1,tmp2)

bw_dtm <- dtm[,intersect(colnames(dtm), bad_words)]
kable(as.data.frame(t(100*rowSums(as.matrix(bw_dtm))/rowSums(as.matrix(dtm))),
                    row.names = c("Relative frequencies")),
      caption="Profane words in percent of all words.",
      col.names = c('Blogs','News','Twitter'),
      digits = 2)

bw=names(head(sort(colSums(as.matrix(bw_dtm)),decreasing = TRUE),10))
my_str=bw[1]
for (i in 2:9) my_str = paste(my_str,bw[i],sep=", ")
my_str = paste(my_str,bw[10],sep=", and ")

dtm = dtm[,-which(colnames(dtm) %in% bad_words)]

## Create Zipf plot
Zipf_plot(dtm,type='b',col='#56B4E9',lwd = 2, main='Zipf Plot')
## Create Heaps plot
Heaps_plot(dtm,type='b',col='#009E73',lwd = 3, main='Heaps Plot')
## Create word clouds
par(mfrow = c(1, 3), bg = "#EEEEEE") 
headings = c("Blogs", "News", "Twitter")

for (i in 1:length(docs)) {
    wordcloud(words = colnames(as.matrix(dtm)), freq = as.matrix(dtm[i,]), 
        scale = c(3, 1), max.words = 57, use.r.layout=FALSE, 
        random.order = T, rot.per = 0.23, colors =  c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"))
    title(headings[i])
}
## Create Plot function
myPlot <- function(df, title, ylab ){
    maxCount = max(df$Count)
    minCount = min(df$Count)
    myCol=character(0)
    for (i in 1:19) myCol = c(myCol,rainbow(maxCount)[(df$Count[i]-minCount)*maxCount/(maxCount-minCount)])
    myCol[20]=rainbow(maxCount)[1]
   
    ggplot(df,aes(x=df$Word,y=df$Count,fill=factor(1:20))) +
        geom_bar(width=0.1,stat='identity',aes(colour=df$Count)) +
        geom_point(size=7, aes(colour = df$Count)) +
        scale_colour_gradientn(colors=rainbow(maxCount)) +
        scale_fill_manual(values=myCol) +
        coord_flip() +
        theme(axis.text = element_text(size = 15)) +
        xlab("") +
        ylab(ylab) +
        ggtitle(title) +
        theme(legend.position="none")
}


## Find most frequent words
tmp=head(sort(colSums(as.matrix(dtm)),decreasing = TRUE),20)
df=data.frame(Word=names(tmp),Count=tmp)
df$Word <- factor(df$Word, levels = df$Word[order(df$Count)])
myPlot(df,"20 Most Frequent words", "Word Count")
## Plot frequencies of word pairs
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
bigram_dtm <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
tmp2=head(sort(colSums(as.matrix(bigram_dtm)),decreasing = TRUE),20)
df2=data.frame(Word=names(tmp2),Count=tmp2)
df2$Word <- factor(df2$Word, levels = df2$Word[order(df2$Count)])

myPlot(df2,"20 Most Frequent Words Pairs", "Word Pair Count")
## Plot frequencies of 3-grams
TrigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
trigram_dtm <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))
tmp3=head(sort(colSums(as.matrix(trigram_dtm)),decreasing = TRUE),20)
df3=data.frame(Word=names(tmp3),Count=tmp3)
df3$Word <- factor(df3$Word, levels = df3$Word[order(df3$Count)])

myPlot(df3,"20 Most Frequent 3-grams", "3-gram Count")
## Plot frequencies of 4-grams
FourgramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 4, max = 4))}
fourgram_dtm <- DocumentTermMatrix(docs, control = list(tokenize = FourgramTokenizer))
tmp4=head(sort(colSums(as.matrix(fourgram_dtm)),decreasing = TRUE),20)
df4=data.frame(Word=names(tmp4),Count=tmp4)
df4$Word <- factor(df4$Word, levels = df4$Word[order(df4$Count)])

myPlot(df4,"20 Most Frequent 4-grams", "4-gram Count")
## Print r code

## Print Session info
sessionInfo()

Session Info

## R version 3.1.3 (2015-03-09)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252 
## [2] LC_CTYPE=Norwegian (Bokmål)_Norway.1252   
## [3] LC_MONETARY=Norwegian (Bokmål)_Norway.1252
## [4] LC_NUMERIC=C                              
## [5] LC_TIME=Norwegian (Bokmål)_Norway.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] doParallel_1.0.10  iterators_1.0.8    foreach_1.4.3     
##  [4] reshape2_1.4.1     ggplot2_2.1.0      slam_0.1-32       
##  [7] wordcloud_2.5      RColorBrewer_1.1-2 tm_0.6-2          
## [10] NLP_0.1-9          knitr_1.12.3       stringi_1.0-1     
## 
## loaded via a namespace (and not attached):
##  [1] codetools_0.2-10   colorspace_1.2-6   digest_0.6.9      
##  [4] evaluate_0.8.3     formatR_1.3        grid_3.1.3        
##  [7] gtable_0.2.0       highr_0.5.1        htmltools_0.3     
## [10] labeling_0.3       magrittr_1.5       munsell_0.4.3     
## [13] plyr_1.8.3         Rcpp_0.12.3        rJava_0.9-8       
## [16] rmarkdown_0.9.5    RWeka_0.4-25       RWekajars_3.7.13-1
## [19] scales_0.4.0       stringr_1.0.0      tools_3.1.3       
## [22] yaml_2.1.13