HC Corpora - Exploratory Analysis

Summary

This is an exploratory analysis of text data obtained from HC Corpora. The final goal is to create a language model that can predict the next word given an input phrase with multiple words.

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details.

This analysis covers the following

Data loading
Data cleanup
Summary statistics
Plots to explore key features
Observations and
Plan for the language model

Data loading

The Corpora has data for four languages. This exploratory analysis is based on US English data (en_US). It covers the following 3 files

en_US.blog.txt
en_US.news.txt
en_US.twitter.txt

The following sections cover the environment setup and data loading

The required libraries and functions are loaded

options(rpubs.upload.method = "internal")

# Load libraries
library ("tm")          # text mining
library ("RWeka")       # text mining
library ("RWekajars")   # text mining
library ("sqldf")       # sql
library ("stringr")     # working with strings
library ("ggplot2")     # plotting

# Functions
# Function to sample lines from a source
SampleLines = function (LinesTmp, Pct = 0.05) {
  NLines = length (as.list (LinesTmp))
  LinesTmp = LinesTmp[sample (NLines, size = (NLines * Pct), replace = F)]
  
  return (LinesTmp)
}

# Function to count words and lines
CountWL = function (DataType, SampleData) {
  W = sum (sapply(gregexpr("\\W+", SampleData), length) + 1)
  L = length (SampleData)
  Avg = round (W/L)
  
  Summary = paste (DataType, ": Number of Lines -", L, "; Number of Words -", W,
                   "; Avg Number of Words per Line -", Avg)
  return (Summary)
}

# Function to create tokens in small batches
# process small chunks to avoid java GC overhead limit
BuildTokens = function (Lines, ChunkSize = 2000, Min = 1, Max = 1, Delimit = " ") {
  Tokens = character ()
  NumLines = length (Lines)
  i = 1

  while (i <= NumLines) {
    j = min (i + ChunkSize, NumLines)
    print (paste (i, j))
  
    TokensTmp = NGramTokenizer(Lines[i:j], Weka_control(min = Min, max = Max, delimiters = Delimit))
    Tokens = c(Tokens, TokensTmp)
  
    i = i + ChunkSize + 1
  }
  
  return (Tokens)
}

Data files are loaded and sampled

# Load data
# Files
SourceFileBlog = ("./final/en_US/en_US.blogs.txt")
SourceFileNews = ("./final/en_US/en_US.news.txt")
SourceFileTwitter = ("./final/en_US/en_US.twitter.txt")

N = -1 # number of lines to load. -1 for all lines
LinesBlog = readLines (con = SourceFileBlog, n = N)
LinesNews = readLines (con = SourceFileNews, n = N)
LinesTwitter = readLines (con = SourceFileTwitter, n = N)


SamplePct = 0.1 # % of data to sample
set.seed (123)

# Sample from original files
SampleBlog = SampleLines (LinesBlog, SamplePct)
SampleNews = SampleLines (LinesNews, SamplePct)
SampleTwitter = SampleLines (LinesTwitter, SamplePct)

# consolidate all samples
Lines = c (SampleBlog, SampleNews, SampleTwitter)

Here are some summary counts of the raw data.

# Count of words and lines of the raw data before any cleanup
CountWL ("Blog Data   ", LinesBlog)

## [1] "Blog Data    : Number of Lines - 899288 ; Number of Words - 39386844 ; Avg Number of Words per Line - 44"

CountWL ("News Data   ", LinesNews)

## [1] "News Data    : Number of Lines - 77259 ; Number of Words - 2837489 ; Avg Number of Words per Line - 37"

CountWL ("Twitter Data", LinesTwitter)

## [1] "Twitter Data : Number of Lines - 2360148 ; Number of Words - 32874008 ; Avg Number of Words per Line - 14"

Data cleanup

Before proceeding further, the data needs to be cleaned up of any unwanted characters. A quick look at the data reveals that there are digits, punctuations and control characters that may not be of much help in the final goal of predicting the next word.

# Look for digits, control and punctuation characters
grep("[[:digit:]]", Lines, value = T)[1:5]
grep("[[:cntrl:]]", Lines, value = T)[1:5]
grep("[[:punct:]]", Lines, value = T)[1:5]

These characters are cleaned up from the sampled files before we proceed with the next step of creating tokens for further exploration. Also, all the text is converted to lower case for proper consolidation. Swear words are not removed at this point, since they might add some value in predicting the next word. They will be handled in the final language model.

Lines = gsub ("[[:digit:]]", "", Lines)
Lines = gsub ("[[:punct:]]", "", Lines)
Lines = gsub ("[[:cntrl:]]", "", Lines)
Lines = tolower (Lines)

The next step is tokenization which is the process of breaking a stream of text up into words, phrases, symbols or other meaningful elements. One and two-gram tokens are created for this analysis.

# Unigram tokens
TokensUg = BuildTokens (Lines, ChunkSize = 2000, Min = 1, Max = 1, Delimit = " ")

# Frequency of unigram tokens
FreqUg = as.data.frame (table (TokensUg), stringsAsFactors = F)
names (FreqUg) = c ("Token", "Freq")

# Sort the tokens by frequency
FreqUg = FreqUg[order(FreqUg$Freq, decreasing = T),]

# Categorize frequncies for further analysis
q = c (0, 10, 100, 1000, 10000, 100000, 1000000)
FreqUg$Bucket = cut (FreqUg[, "Freq"], q, 
                            include.lowest = T, dig.lab = 6)


# Bigram tokens
TokensBg = BuildTokens (Lines, ChunkSize = 2000, Min = 2, Max = 2, Delimit = " ")

# Frequency of bigram tokens
FreqBg = as.data.frame (table (TokensBg), stringsAsFactors = F)
names (FreqBg) = c ("Token", "Freq")
FreqBg = FreqBg[order(FreqBg$Freq, decreasing = T),]

Summary statistics

# Total vocabulary
print (paste ("Total size of vocabulary =", length (TokensUg)))

## [1] "Total size of vocabulary = 6890018"

# Coverage
CovSample = round (sum (FreqUg[1:round(nrow(FreqUg)/2), "Freq"]) / length (TokensUg) * 100)
print (paste ("50% of the vocabulary covers ", CovSample, "% of the data sample"))

## [1] "50% of the vocabulary covers  99 % of the data sample"

Plots to explore key features

Top 20 most frequently used words

print (FreqUg[1:20, c("Token", "Freq")], row.names = F)

##  Token   Freq
##    the 294266
##     to 191954
##    and 158430
##      a 156774
##      i 149295
##     of 129598
##     in 101987
##    you  84782
##     is  81447
##    for  77566
##   that  71937
##     it  70566
##     on  56692
##     my  56106
##   with  47833
##   this  42648
##    was  41227
##     be  40488
##   have  39696
##     at  37693

Histogram of word frequencies. This shows the extent of sparse words in the corpus

qplot(Bucket, data = FreqUg, geom = "histogram")

plot of chunk HistFreq

Histogram of the Top 20 bigrams

FreqWordsHist = ggplot(FreqBg[1:20,], aes (x =  Token, y = Freq))
FreqWordsHist = FreqWordsHist + 
                  geom_bar (stat = "identity", fill = "black")
FreqWordsHist = FreqWordsHist + theme (axis.text.x = element_text (angle = 45, hjust = 1))
FreqWordsHist

plot of chunk Top20BgHist

Observations

The data contains digits, control and punctuation characters that may not add much predictive value.
Top 50% of the vocabulary covers most of the sampled data.
There are a lot of sparse elements that may not add much predictive value.

Plan for the language model

Load the raw data and cleanup unwanted characters.
Sample a reasonable percent of data (10% - 50%)
Build n-gram tokens - 1 to 4
Remove sparse tokens.
Build a launguage model using the maximum likelihood estimates of the n-gram probabilities. This will be a lookup table with a key (n-1 word phrase) and the predicted next word, based on the highest probability. If the predicted word is a swear word, then the word with the next highest probability is used.
Given an input phrase, the same cleanup rules as described above are applied and the last n (=3) words are used to lookup in the language model. If a match is found, it is returned. If not, a back-off strategy is used, to lookup based on the last n-1 words, n-2 words etc, until a match is found.
Other things to consider, to improve this model
1. instead of storing only the next word with the highest probability in the language model, store a few other possibilities as well.
2. include associations between phrases, in the model and use them during lookup