Executive Summary

The word to sentence ratio is similar to that of the word to line ratio for twitter, but significantly different for the news and blog data-sets (en_US). This influences the approach for creating the training data-sets for the generation and testing of the n-gram data-models to perfect a next word prediction tool.

Introduction

The goal of this Milestone Report submission is to provide a brief and concise summary regarding the data analysis taken place during the initial stages of the project.

An overview of how the data was obtained and the initial summary statistics regarding the dataset will be shared. In addition an initial plan for the creation of the prediction algorithm.

Data Exploration

Data Source and Strategy

The test dataset was obtained directly from Coursera which has been language filtered. The original data is from a corpus called HC Corpora [www.corpora.heliohost.org]((http://www.corpora.heliohost.org) which has been collected by webcrawler programs.

When analysing the files I am priortising looking at the data as sentences and not as lines. My reasoning is that I’m looking to predict the next word of a sentence and never the starting word of a sentence, so it doesn’t make sense at this stage to consider the ends and starts of sentences together. Investigating the average number of words per sentence and not per line, should highlight:-

Sentence length differences between blogs, news and twitter (expectations is that length decreases in that order)
Word length differing between data sources (e.g. due to length of tweet restrictions words are more likely to be shorter and not in the English dictionary, e.g. l8r = later)
Grammar is expected to be lacking in twitter comparsed to a written and reviewed blog or news bulletin (out of scope of study)

The provided data files contain non-ASCII characters that must be read using readbinary otherwise on windows this prevents the reading of the full-dataset and provides an incomplete view

Reproducible Discovery

start.time <- Sys.time()
library(stringr)
data_directory <- "../data/raw/en_US/"
save_directory <- "../data/clean/en_US/"
files <- list.files( data_directory, pattern = ".txt", recursive=TRUE)
summary_table <- NULL
#
for (filename in files) {
  fin <- file(paste0(data_directory,filename), open="rb",  encoding="UTF-8" )
  data.text <- readLines(fin, n=-1, encoding="UTF-8", warn=FALSE) 
  close(fin)
  # Replace accented characters
  data.text <- iconv(data.text, to='ASCII//TRANSLIT')
  # convert to lowercase
  data.text <- tolower(data.text)
  sentencecnt <- sum(str_count(data.text,"[\\.|?]"))
  linecnt <- length(data.text)
  #Split the strings into separate words
  data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
  data.words.vector <- unlist(data.words.list)
  wordcnt <- length(data.words.vector)
  ##List each word and frequency (top-10)
  data.freq.list <- sort( table(data.words.vector) , decreasing=TRUE)
  print( paste("Top 10 words for data-set",filename))
  print( data.freq.list[1:10]) # Top 10 common words
  ##List frequency distribution of word length
  data.freq.wordlength <- table( unlist( lapply( data.words.list , nchar) ) )
  print(paste("Most popular word length for",filename,"is:",names(data.freq.wordlength)[which.max(data.freq.wordlength)]) )
  #
  title.text <- paste('Word length frequency', str_replace(filename,"en_US.",""))
  barplot( data.freq.wordlength[data.freq.wordlength>100] , 
           col=rainbow(16), main=title.text, xlab="Word size",ylab="Frequency",xlim=c(1,15),
           cex.axis=0.6, cex.names=0.6, cex.main=0.8)
  #
  summary_table <- rbind(summary_table , 
    cbind(filename, sentencecnt, linecnt, 
          wordsentenceratio=round( wordcnt / sentencecnt , 2), 
          wordlineratio=round( wordcnt / linecnt , 2) ) )
  # Preserve object
  #save(data.text,file=paste0(save_directory,filename,".RData"),compress=TRUE,compression_level=9 )
}

## [1] "Top 10 words for data-set en_US.blogs.txt"
## data.words.vector
##     the       a     and      to      of       i      in    that      it 
## 1854668 1174753 1093421 1068774  876535  846674  597655  471084  443575 
##      is 
##  431813 
## [1] "Most popular word length for en_US.blogs.txt is: 3"

## [1] "Top 10 words for data-set en_US.news.txt"
## data.words.vector
##     the       a      to     and      of      in       s    that     for 
## 1971936 1059007  905973  889072  774469  678854  386394  367727  353763 
##      is 
##  284161 
## [1] "Most popular word length for en_US.news.txt is: 3"

## [1] "Top 10 words for data-set en_US.twitter.txt"
## data.words.vector
##    the      i     to      a    you    and    for     it     in     of 
## 937298 921720 788788 674361 599859 438658 385397 382222 380498 359670 
## [1] "Most popular word length for en_US.twitter.txt is: 4"

This highlights the contextual differences within the data-sources and confirms that these words are the words that would need to be removed prior to building the model (aka stopwords) which are fairly consistent between all the top-10 used words for each data source. Unfortunately the numbers do not provide a clear indication for the value of ‘n’ (for n-grams).

Data Summary

The below table shows the comparison of the three file types, with blogs and news showing a significant difference in the line and sentence ratios, unfortunately it’s not practical to create, store and use an n-gram of 10, but this does support the creation of sentences into the NLP library instead of supply an entire line of text as a sequence.

summary_table

##      filename            sentencecnt linecnt   wordsentenceratio
## [1,] "en_US.blogs.txt"   "3213456"   "899288"  "12.03"          
## [2,] "en_US.news.txt"    "2523440"   "1010242" "14.24"          
## [3,] "en_US.twitter.txt" "3185329"   "2360148" "9.81"           
##      wordlineratio
## [1,] "42.99"      
## [2,] "35.57"      
## [3,] "13.25"

Plans for creating a prediction algorithm

My planned next steps for this project are:-

Investigate and select most suitable NLP library (model size is a key requirement)
Create sample data-sets with a probability of 0.01 (based on totals rows of 4.2million (population size) and allowing for a margin of error of 2% and confidence level of 99% this yields a sample size of 4143, probability for including rows is 0.0009, round to 1%)
Clean the set of language accents, as these are not consistent in the data-set (and this is an en_US set)
Substitution of profane words (see references) as my theory is the removal will change the context and structure of the sentence and thus will be replaced with a token (e.g. profanity). This can be tested.
Decide upon words separated by ‘/’ should be split or remove 2nd word (strategy to be tested)
Build and experiment with the size, performance of different n-gram models (research papers suggest significant changes between 1 to 3 with smaller increments made with a higher cost to processing as n increases above this threshold)
Understand the footprint limitations of shinyapp (e.g. requirements)

References

Profanity word list (english/american) http://www.bannedwordlist.com/lists/swearWords.txt
Stackoverflow - Replace accented characters
Sample Size Calculator

Processing time: 5.2327428

Data Discovery - Capstone Swiftkey Project

Kev Scarr

Friday, November 07, 2014