This is an exploratory analysis of text data obtained from HC Corpora. The final goal is to create a language model that can predict the next word given an input phrase with multiple words.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details.
This analysis covers the following
Data loading
Data cleanup
Summary statistics
Plots to explore key features
Observations and
Plan for the language model
The Corpora has data for four languages. This exploratory analysis is based on US English data (en_US). It covers the following 3 files
en_US.blog.txt
en_US.news.txt
en_US.twitter.txt
The following sections cover the environment setup and data loading
options(rpubs.upload.method = "internal")
# Load libraries
library ("tm") # text mining
library ("RWeka") # text mining
library ("RWekajars") # text mining
library ("sqldf") # sql
library ("stringr") # working with strings
library ("ggplot2") # plotting
# Functions
# Function to sample lines from a source
SampleLines = function (LinesTmp, Pct = 0.05) {
NLines = length (as.list (LinesTmp))
LinesTmp = LinesTmp[sample (NLines, size = (NLines * Pct), replace = F)]
return (LinesTmp)
}
# Function to count words and lines
CountWL = function (DataType, SampleData) {
W = sum (sapply(gregexpr("\\W+", SampleData), length) + 1)
L = length (SampleData)
Avg = round (W/L)
Summary = paste (DataType, ": Number of Lines -", L, "; Number of Words -", W,
"; Avg Number of Words per Line -", Avg)
return (Summary)
}
# Function to create tokens in small batches
# process small chunks to avoid java GC overhead limit
BuildTokens = function (Lines, ChunkSize = 2000, Min = 1, Max = 1, Delimit = " ") {
Tokens = character ()
NumLines = length (Lines)
i = 1
while (i <= NumLines) {
j = min (i + ChunkSize, NumLines)
print (paste (i, j))
TokensTmp = NGramTokenizer(Lines[i:j], Weka_control(min = Min, max = Max, delimiters = Delimit))
Tokens = c(Tokens, TokensTmp)
i = i + ChunkSize + 1
}
return (Tokens)
}
# Load data
# Files
SourceFileBlog = ("./final/en_US/en_US.blogs.txt")
SourceFileNews = ("./final/en_US/en_US.news.txt")
SourceFileTwitter = ("./final/en_US/en_US.twitter.txt")
N = -1 # number of lines to load. -1 for all lines
LinesBlog = readLines (con = SourceFileBlog, n = N)
LinesNews = readLines (con = SourceFileNews, n = N)
LinesTwitter = readLines (con = SourceFileTwitter, n = N)
SamplePct = 0.1 # % of data to sample
set.seed (123)
# Sample from original files
SampleBlog = SampleLines (LinesBlog, SamplePct)
SampleNews = SampleLines (LinesNews, SamplePct)
SampleTwitter = SampleLines (LinesTwitter, SamplePct)
# consolidate all samples
Lines = c (SampleBlog, SampleNews, SampleTwitter)
Here are some summary counts of the raw data.
# Count of words and lines of the raw data before any cleanup
CountWL ("Blog Data ", LinesBlog)
## [1] "Blog Data : Number of Lines - 899288 ; Number of Words - 39386844 ; Avg Number of Words per Line - 44"
CountWL ("News Data ", LinesNews)
## [1] "News Data : Number of Lines - 77259 ; Number of Words - 2837489 ; Avg Number of Words per Line - 37"
CountWL ("Twitter Data", LinesTwitter)
## [1] "Twitter Data : Number of Lines - 2360148 ; Number of Words - 32874008 ; Avg Number of Words per Line - 14"
Before proceeding further, the data needs to be cleaned up of any unwanted characters. A quick look at the data reveals that there are digits, punctuations and control characters that may not be of much help in the final goal of predicting the next word.
# Look for digits, control and punctuation characters
grep("[[:digit:]]", Lines, value = T)[1:5]
grep("[[:cntrl:]]", Lines, value = T)[1:5]
grep("[[:punct:]]", Lines, value = T)[1:5]
These characters are cleaned up from the sampled files before we proceed with the next step of creating tokens for further exploration. Also, all the text is converted to lower case for proper consolidation. Swear words are not removed at this point, since they might add some value in predicting the next word. They will be handled in the final language model.
Lines = gsub ("[[:digit:]]", "", Lines)
Lines = gsub ("[[:punct:]]", "", Lines)
Lines = gsub ("[[:cntrl:]]", "", Lines)
Lines = tolower (Lines)
The next step is tokenization which is the process of breaking a stream of text up into words, phrases, symbols or other meaningful elements. One and two-gram tokens are created for this analysis.
# Unigram tokens
TokensUg = BuildTokens (Lines, ChunkSize = 2000, Min = 1, Max = 1, Delimit = " ")
# Frequency of unigram tokens
FreqUg = as.data.frame (table (TokensUg), stringsAsFactors = F)
names (FreqUg) = c ("Token", "Freq")
# Sort the tokens by frequency
FreqUg = FreqUg[order(FreqUg$Freq, decreasing = T),]
# Categorize frequncies for further analysis
q = c (0, 10, 100, 1000, 10000, 100000, 1000000)
FreqUg$Bucket = cut (FreqUg[, "Freq"], q,
include.lowest = T, dig.lab = 6)
# Bigram tokens
TokensBg = BuildTokens (Lines, ChunkSize = 2000, Min = 2, Max = 2, Delimit = " ")
# Frequency of bigram tokens
FreqBg = as.data.frame (table (TokensBg), stringsAsFactors = F)
names (FreqBg) = c ("Token", "Freq")
FreqBg = FreqBg[order(FreqBg$Freq, decreasing = T),]
# Total vocabulary
print (paste ("Total size of vocabulary =", length (TokensUg)))
## [1] "Total size of vocabulary = 6890018"
# Coverage
CovSample = round (sum (FreqUg[1:round(nrow(FreqUg)/2), "Freq"]) / length (TokensUg) * 100)
print (paste ("50% of the vocabulary covers ", CovSample, "% of the data sample"))
## [1] "50% of the vocabulary covers 99 % of the data sample"
Top 20 most frequently used words
print (FreqUg[1:20, c("Token", "Freq")], row.names = F)
## Token Freq
## the 294266
## to 191954
## and 158430
## a 156774
## i 149295
## of 129598
## in 101987
## you 84782
## is 81447
## for 77566
## that 71937
## it 70566
## on 56692
## my 56106
## with 47833
## this 42648
## was 41227
## be 40488
## have 39696
## at 37693
Histogram of word frequencies. This shows the extent of sparse words in the corpus
qplot(Bucket, data = FreqUg, geom = "histogram")
Histogram of the Top 20 bigrams
FreqWordsHist = ggplot(FreqBg[1:20,], aes (x = Token, y = Freq))
FreqWordsHist = FreqWordsHist +
geom_bar (stat = "identity", fill = "black")
FreqWordsHist = FreqWordsHist + theme (axis.text.x = element_text (angle = 45, hjust = 1))
FreqWordsHist
The data contains digits, control and punctuation characters that may not add much predictive value.
Top 50% of the vocabulary covers most of the sampled data.
There are a lot of sparse elements that may not add much predictive value.
Load the raw data and cleanup unwanted characters.
Sample a reasonable percent of data (10% - 50%)
Build n-gram tokens - 1 to 4
Remove sparse tokens.
Build a launguage model using the maximum likelihood estimates of the n-gram probabilities. This will be a lookup table with a key (n-1 word phrase) and the predicted next word, based on the highest probability. If the predicted word is a swear word, then the word with the next highest probability is used.
Given an input phrase, the same cleanup rules as described above are applied and the last n (=3) words are used to lookup in the language model. If a match is found, it is returned. If not, a back-off strategy is used, to lookup based on the last n-1 words, n-2 words etc, until a match is found.
Other things to consider, to improve this model
instead of storing only the next word with the highest probability in the language model, store a few other possibilities as well.
include associations between phrases, in the model and use them during lookup