Introduction

In Part 1, we partitioned the corpus data into to training and test sets and performed a number of pre-processing steps in our analysis pipeline that got this data ready to build n-gram frequency tables. The motivation behind building these tables is that they are needed by our language model to make predictions. How these tables are used is described in detail in Predicting Next Word Using Katz Back-Off: Part 3 - Understanding the Katz Back-Off Model, but in this document, we’ll focus on generating the unigram, bigram, and trigram tables and doing some exploratory data analysis (EDA) to get a basic understanding of the data along the way.

Note: All functions used in this analysis are defined in the Appendix unless stated otherwise.

Descriptive Statistics

It’s usually a good idea to begin any analysis involving a large number of elements (e.g. words in this context) with some descriptive statistics. The functions used to generate the table below were written to run against the original unprocessed corpus files to get the initial file sizes, line counts, vocabulary (count of all words), and word types (count of unique words) respectively. The code is listed in the Appendix:

File Name File Size (MB) Line Count Word Count (all words) Word Types (unique words)
en_US.blogs.txt 200.42 899288 37334131 1103548
en_US.news.txt 196.28 1010242 34372530 876772
en_US.twitter.txt 159.36 2360148 30373584 1290172

Unigrams

With over 30M words in each corpus, we’d like to know which ones occur the most frequently. To figure this out, we first build unigram frequency tables for each of the corpus files using the makeRawNgrams function using the default settings. The inputs to this function were the last set of files from Part 1:

The output files from this function were:

From the above three unigram frequency tables, the following charts were created using the getTopM_NgramsPlot function:

Note on Special Tokens

The NUM and EOS in the N-gram Count charts are special tokens. The NUM token was used to replace all the digit characters in the corpus. The EOS token was used to mark the end of sentences.

Unigram Singletons

Singletons are instances that only occur once in a corpus. Unigram singletons are single words that only occur once in a corpus. Similarly, bigram and trigram singletons are groups of two and three words that occur together once in a corpus. Why are singletons important? We can start to see the answer to this question by looking at the three plots below which were created using the getCountOfCountPlot function listed in the Appendix:

As we might expect, the number of unigrams that show up once in a corpus (furthest point in the upper left portion of each plot) is much higher than any of the others that appear more than once. What percentage of each of these unigrams are singletons?

blogs.singletons <- getPercentSingletons("blogs")
blogs.singletons
## [1] 53.4556
news.singletons <- getPercentSingletons("news")
news.singletons
## [1] 47.34485
twitter.singletons <-  getPercentSingletons("twitter")
twitter.singletons
## [1] 59.14767

This tells us that roughly half of all words (unigrams) are singletons. O.k., so what does this mean in terms of our prediction model? This has several implications. First, because the probability that a particular unigram singleton accurately completes a phrase is very low (around 1 in 30M+), it means that half the data in our intial en_US.*.train.9rawunig.csv unigram tables have very low predictive power. Second, it means that all the bigrams and trigrams which contain unigram singletons will also be singletons and have low predictive value. Third, because unigram singletons can be in either the first or second position in a bigram, the number of bigrams containing unigram singletons is twice as large as the number of unigram singletons themselves. Fourth, by the same argument, the number of trigrams containing unigram singletons is three times as large as the number of unigram singletons themselves. So in summary, if we include all the unigram singletons in our model, we’ll be carrying around alot of data which doesn’t add much to the power of our model.

Space Constraints

Let’s compare the size of the n-gram tables which including the singletons with those that have them removed. We’ve already generated the unigrams which include the singletons: the en_US.*.train.9rawunig.csv files listed above. The raw bigrams and trigrams (including the singletons) were created using the same inputs to makeRawNgrams, but setting n = 2 to get the bigrams and n = 3 for the trigrams and changing the output file name parameters accordingly. The nosins (no singletons) files were created using the removeSingletons function. The resulting output files and their sizes are shown below.

File Name File Size (MB) File Name File Size (MB) Difference (MB)
en_US.blogs.train.9rawunig.csv 4.32 en_US.blogs.train.12unigrams.nosins.csv 1.92 2.39
en_US.news.train.9rawunig.csv 4.09 en_US.news.train.12unigrams.nosins.csv 2.07 2.02
en_US.twitter.train.9rawunig.csv 3.74 en_US.twitter.train.12unigrams.nosins.csv 1.45 2.29
en_US.blogs.train.10rawbigrams.csv 88.99 en_US.blogs.train.13bigrams.nosins.csv 24.29 64.7
en_US.news.train.10rawbigrams.csv 94.5 en_US.news.train.13bigrams.nosins.csv 27.26 67.24
en_US.twitter.train.10rawbigrams.csv 55.72 en_US.twitter.train.13bigrams.nosins.csv 14.48 41.24
en_US.blogs.train.11rawtrigrams.csv 310.22 en_US.blogs.train.14trigrams.nosins.csv 44.6 265.62
en_US.news.train.11rawtrigrams.csv 318.9 en_US.news.train.14trigrams.nosins.csv 48.77 270.12
en_US.twitter.train.11rawtrigrams.csv 169.14 en_US.twitter.train.14trigrams.nosins.csv 24.56 144.58
Totals 1049.61 189.4 860.21

The current free plan on shinyapps.io limits deployments to 1 Gb of memory for a large instance which is quite generous. However, as the above table shows, the total size of the raw n-gram tables (n-gram frequecy tables which include the singletons) exceed the free plan limits. Fortunately, by removing the singletons, we reduce the size of our 9 n-gram tables by roughly an order of magnitude.

Bigrams

Using either of the bigram frequency tables listed above as inputs to the getTopM_NgramsPlot function, the 10 highest frequency bigrams for each corpus are shown in Figures 7 thru 9 below.

Trigrams

Using either of the trigram frequency tables listed above as inputs to the getTopM_NgramsPlot function, the 10 highest frequency trigrams for each corpus are shown in Figures 10 thru 12 below.

Next Section:

Predicting Next Word Using Katz Back-Off: Part 3 - Understanding the Katz Back-Off Model

Appendix

########## Descriptive Statistics section ##########
library(readr)
fnames <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
datDir <- "../../data/en_US/originals/"
## Taken from http://rpubs.com/mszczepaniak/predictkbo1preproc
## Reads the text corpus data file and returns a character array where every
## element is a line from the file.
## fileId = string, text fragment of file name to be read e.g. 'blogs', 'news',
##          or 'twit'
## dataDir = path to data file to be read
## fnames = file names to be read which have fileId fragments
getFileLines <- function(fileId, dataDir=datDir, fileNames=fnames) {
    if(grep(fileId, fnames) > 0) index <- grep(fileId, fnames)
    else {
        cat('getFileLines could undestand what file to read:', fileId)
        return(NULL)
    }
    fileLines <- read_lines(sprintf("%s%s", dataDir, fnames[index]))
    return(fileLines)
}

## Returns the file size in Mb
getFileSize <- function(dataDir=datDir, filename) {
    inputFilePath <- paste0(dataDir, filename)
    return(file.size(inputFilePath) / (2^20))  # convert to MB
}

getLineCount <- function(dataDir=datDir, fileType) {
    return(length(getFileLines(fileType)))
}

## Somewhat crude estimate of word count, but is very close to other methods.
## Assumes that words are separated by spaces.
getWordCount <- function(fileType) {
    f <- getFileLines(fileType)
    return(length(unlist(strsplit(f, " "))))
}

## Returns the number of unique words (tokens) the fileType file.
getTypeCount <- function(fileType) {
    f <- getFileLines(fileType)
    return(length(unique(unlist(strsplit(f, " ")))))
}
########## N-gram Computations and Frequency Charts ##########
library(dplyr)
library(ggplot2)
# https://cran.r-project.org/web/packages/cowplot/vignettes/plot_grid.html
library(cowplot)
library(quanteda)

getNgramType <- function(ngram=1) {
    ngType <- ifelse((ngram==1), "Unigram",
                      ifelse((ngram==2), "Bigram",
                              ifelse((ngram==3), "Trigram", "Unknown")))
    return(ngType)
}

## Returns a named vector of n-grams and their associated frequencies
## extracted from the character vector dat.
##
## ng - Defines the type of n-gram to be extracted: unigram if ng=1,
##      bigram if ng=2, trigram if n=3, etc.
## dat - Character vector from which we want to get n-gram counts.
## igfs - Character vector of words (features) to ignore from frequency table
## sort.by.ngram - sorts the return vector by the names
## sort.by.freq - sorts the return vector by frequency/count
getNgramFreqs <- function(ng, dat, igfs=NULL,
                          sort.by.ngram=TRUE, sort.by.freq=FALSE) {
    # http://stackoverflow.com/questions/36629329/
    # how-do-i-keep-intra-word-periods-in-unigrams-r-quanteda
    if(is.null(igfs)) {
        dat.dfm <- dfm(dat, ngrams=ng, toLower = FALSE, removePunct = FALSE,
                       what = "fasterword", verbose = FALSE)
    } else {
        dat.dfm <- dfm(dat, ngrams=ng, toLower = FALSE, ignoredFeatures=igfs,
                       removePunct = FALSE, what = "fasterword", verbose = FALSE)
    }
    rm(dat)
    # quanteda docfreq will get the document frequency of terms in the dfm
    ngram.freq <- docfreq(dat.dfm)
    if(sort.by.freq) { ngram.freq <- sort(ngram.freq, decreasing=TRUE) }
    if(sort.by.ngram) { ngram.freq <- ngram.freq[sort(names(ngram.freq))] }
    rm(dat.dfm)
    
    return(ngram.freq)
}

## Returns a 2 column data.table. The first column (ngram) contains the
## unigram (if n=1), the bigram (if n=2), etc.. The second column
## (freq) contains the frequency or count of the ngram found in linesCorpus.
##
## linesCorpus - character vector
## igfs - Character vector of words (features) to ignore from frequency table
## sort.by.ngram - If TRUE (default), returned table is sorted by ngram
## sort.by.freq - If TRUE, returned table is sorted by frequency, default=FALSE
## prefixFilter - string/character vector: If not NULL, tells the function
##                to return only rows where ngram column starts with prefixFilter.
##                If NULL, returns all the ngram and count rows.
getNgramTables <- function(n, linesCorpus, igfs=NULL, sort.by.ngram=TRUE,
                           sort.by.freq=FALSE, prefixFilter=NULL) {
    cat("start getNgramTables:", as.character(Sys.time()), "\n")
    ngrams <- getNgramFreqs(n, linesCorpus, igfs, sort.by.ngram, sort.by.freq)
    ngrams.dt <- data.table(ngram=names(ngrams), freq=ngrams)
    if(length(grep('^SOS', ngrams.dt$ngram)) > 0) {
        ngrams.dt <- ngrams.dt[-grep('^SOS', ngrams.dt$ngram),]
    }
    if(!is.null(prefixFilter)) {
        regex <- sprintf('%s%s', '^', prefixFilter)
        ngrams.dt <- ngrams.dt[grep(regex, ngrams.dt$ngram),]
    }
    cat("FINISH getNgramTables:", as.character(Sys.time()), "\n")
    return(ngrams.dt)
}

## Creates and writes out the raw n-gram frequecy tables for each of the 
## corpus data files.  These are the initial n-gram tables that include the
## singletons.  Defaults to unigrams: n=1
## table.dir - string: dir where files to processes reside
## filePrefix - string: prefix of files to process
## inFilePostfix - string: ending/postfix portion of input file name
## outFilePostfix - string: ending/postfix portion of output file name
## n - integer: 1 if unigram table is to be created, 2 if bigram, 3 if trigram
makeRawNgrams <- function(table.dir=ddir, filePrefix="en_US.",
                          inFilePostfix=".train.8posteos.txt",
                          outFilePostfix=".train.9rawunig.csv",
                          fileTypes=c("blogs", "news", "twitter"), n=1) {
    inPaths <- sprintf("%s%s%s%s", table.dir, filePrefix, fileTypes,
                       inFilePostfix)
    outPaths <- sprintf("%s%s%s%s", table.dir, filePrefix, fileTypes,
                        outFilePostfix)
    for(i in 1:length(inPaths)) {
        charvect <- read_lines(inPaths[i])
        ngrams.raw <- getNgramTables(n, charvect)
        write.csv(ngrams.raw, outPaths[i], row.names = FALSE)
    }
}

## Returns a barplot of the topm (default = 10) ngrams (default = 1)
##
## data_path - path to the ngram vs. frequency data file
## data_type - string: "Blogs", "News", or "Twitter"
## ngram - integer: 1, 2, or 3
## topm - number of ngrams to display
getTopM_NgramsPlot <- function(data_path, data_type, ngram=1, topm=10,
                               txt_ang=45, vj=0.5, hj=0) {
    data.ngrams.raw <- read.csv(data_path)
    data.ngrams.raw <- arrange(data.ngrams.raw, desc(freq))
    ngrams.topm <- data.ngrams.raw[2:(topm+1),]  # EOS is most frequent: remove
    p <- ggplot(ngrams.topm, aes(x=reorder(ngram, freq, desc), y=freq))
    p <- p + geom_bar(stat = "identity")
    ng_type <- getNgramType(ngram)
    p <- p + xlab(ng_type) + ylab(sprintf("%s%s", ng_type, " Count"))
    # See eMPee584 comment under: http://stackoverflow.com/questions/11610377#comment-47793720
    p <- p + scale_y_continuous(labels=function(n){format(n, scientific = FALSE)})
    chart_title <- sprintf("%s%s%s%s%s%s%s", "Top ", topm, " ", data_type,
                           "\n", ng_type, " Counts")
    p <- p + ggtitle(chart_title)
    p <- p + theme(plot.title = element_text(size=10))
    p <- p + theme(axis.title.x = element_text(size=10))
    p <- p + theme(axis.text.x = element_text(size=10, angle=txt_ang, vjust=vj, hjust=hj))
    p <- p + theme(axis.title.y = element_text(size=10))
    
    return(p)
}
########### Unigram Singletons Code ###########

## Returns a scatter plot of the count of each unigram count vs. count
##
## data_path - path to the ngram vs. frequency data file
## data_type - string: "Blogs", "News", or "Twitter"
## xcount - the number of unigram counts to display on the x-axis
getCountOfCountPlot <- function(data_path, data_type, xcount=100) {
    unigrams.raw <- read.csv(data_path)
    unigrams.raw <- arrange(unigrams.raw, desc(freq))
    unigram_freqs <- group_by(unigrams.raw, freq)
    unigram_cofc <- summarise(unigram_freqs, count_of_count = n())
    unigram_cofc <- filter(unigram_cofc, freq <= xcount)
    p <- ggplot(unigram_cofc, aes(x=freq, y=count_of_count))
    p <- p + geom_point()
    p <- p + xlab("Unigram Count") + ylab("Count of Unigram Count")
    p <- p + scale_y_log10(labels=function(n){format(n, scientific = FALSE)})
    chart_title <- sprintf("%s%s", data_type,
                           " Unigram Count\nof Count vs. Count")
    p <- p + ggtitle(chart_title)
    p <- p + theme(plot.title = element_text(size=10))
    p <- p + theme(axis.title.x = element_text(size=10))
    p <- p + theme(axis.title.y = element_text(size=10))
    
    return(p)
}

# unigram count of counts for each corpus type: blogs, news, twitter
path <- "https://www.dropbox.com/s/pv16pe6buubsz4n/en_US.blogs.train.9rawunig.csv?dl=1"
p04 <- getCountOfCountPlot(path, "Blogs")
path <- "https://www.dropbox.com/s/4u90r188ht2etic/en_US.news.train.9rawunig.csv?dl=1"
p05 <- getCountOfCountPlot(path, "News")
path <- "https://www.dropbox.com/s/2sskj3taoq9bkbi/en_US.twitter.train.9rawunig.csv?dl=1"
p06 <- getCountOfCountPlot(path, "Twitter")

plot_grid(p04, p05, p06, labels = c("Fig 4:", "Fig 5:", "Fig 6:"),
          ncol = 3, align = 'h')

## Returns the percentage of singletons in corpus_type.
## corpus_type - string, valid values: "blogs", "news", "twitter"
## data_paths - character vector containing the urls to the blogs, news and twitter
##              raw unigram data files. Default values provided by
##              8/2016 M. Szczepaniak
getPercentSingletons <- function(corpus_type, data_paths=
   c("https://www.dropbox.com/s/pv16pe6buubsz4n/en_US.blogs.train.9rawunig.csv?dl=1",
     "https://www.dropbox.com/s/4u90r188ht2etic/en_US.news.train.9rawunig.csv?dl=1",
     "https://www.dropbox.com/s/2sskj3taoq9bkbi/en_US.twitter.train.9rawunig.csv?dl=1")) {
    if(corpus_type == "blogs") {
        data_path = data_paths[1]
    } else if(corpus_type == "news") {
        data_path = data_paths[2]
    } else if(corpus_type == "twitter") {
        data_path = data_paths[3]
    }
    ngrams <- read.csv(data_path)
    singletons <- length(ngrams[ngrams$freq==1,]$ngram)
    total_ngrams <- length(ngrams$ngram)
    
    return(100 * singletons / total_ngrams)
}
########## Code to build Figures 1 thru 12
# Figures 1-3 top 10 unigram frequencies for each corpus type: blogs, news, twitter
path <- "https://www.dropbox.com/s/pv16pe6buubsz4n/en_US.blogs.train.9rawunig.csv?dl=1"
p01 <- getTopM_NgramsPlot(path, "Blogs", 1, 10, txt_ang=90, vj=0.5, hj=0.9)
p01 <- p01 + coord_cartesian(ylim = c(0, 900000))
path <- "https://www.dropbox.com/s/4u90r188ht2etic/en_US.news.train.9rawunig.csv?dl=1"
p02 <- getTopM_NgramsPlot(path, "News", 1, 10, txt_ang=90, vj=0.5, hj=0.9)
p02 <- p02 + coord_cartesian(ylim = c(0, 900000))
path <- "https://www.dropbox.com/s/2sskj3taoq9bkbi/en_US.twitter.train.9rawunig.csv?dl=1"
p03 <- getTopM_NgramsPlot(path, "Twitter", 1, 10, txt_ang=90, vj=0.5, hj=0.9)
p03 <- p03 + coord_cartesian(ylim = c(0, 900000))

plot_grid(p01, p02, p03, labels = c("Fig 1:", "Fig 2:", "Fig 3:"),
          ncol = 3, align = 'h')

# Figures 4-6 unigram count of counts for each corpus type: blogs, news, twitter
path <- "https://www.dropbox.com/s/pv16pe6buubsz4n/en_US.blogs.train.9rawunig.csv?dl=1"
p04 <- getCountOfCountPlot(path, "Blogs")
path <- "https://www.dropbox.com/s/4u90r188ht2etic/en_US.news.train.9rawunig.csv?dl=1"
p05 <- getCountOfCountPlot(path, "News")
path <- "https://www.dropbox.com/s/2sskj3taoq9bkbi/en_US.twitter.train.9rawunig.csv?dl=1"
p06 <- getCountOfCountPlot(path, "Twitter")

plot_grid(p04, p05, p06, labels = c("Fig 4:", "Fig 5:", "Fig 6:"),
          ncol = 3, align = 'h')

# Figures 7-9 top 10 bigram frequencies for each corpus type: blogs, news, twitter
path <- "https://www.dropbox.com/s/6cgqa487xb0srbt/en_US.blogs.train.13bigrams.nosins.csv?dl=1"
p07 <- getTopM_NgramsPlot(path, "Blogs", 2, 10, txt_ang=90, vj=0.5, hj=0.9)
p07 <- p07 + coord_cartesian(ylim=c(0, 140000))
path <- "https://www.dropbox.com/s/5xobfsotplbqtv3/en_US.news.train.13bigrams.nosins.csv?dl=1"
p08 <- getTopM_NgramsPlot(path, "News", 2, 10, txt_ang=90, vj=0.5, hj=0.9)
p08 <- p08 + coord_cartesian(ylim=c(0, 140000))
path <- "https://www.dropbox.com/s/47hwbqffufmg16m/en_US.twitter.train.13bigrams.nosins.csv?dl=1"
p09 <- getTopM_NgramsPlot(path, "Twitter", 2, 10, txt_ang=90, vj=0.5, hj=0.9)
p09 <- p09 + coord_cartesian(ylim=c(0, 140000))

plot_grid(p07, p08, p09, labels = c("Fig 7:", "Fig 8:", "Fig 9:"),
          ncol = 3, align = 'h')

# Figures 10-12 top 10 trigram frequencies for each corpus type: blogs, news, twitter
path <- "https://www.dropbox.com/s/z0rz707mt3da1h1/en_US.blogs.train.14trigrams.nosins.csv?dl=1"
p10 <- getTopM_NgramsPlot(path, "Blogs", 3, 10, txt_ang=90, vj=0.5, hj=0.9)
p10 <- p10 + coord_cartesian(ylim=c(0, 12000))
path <- "https://www.dropbox.com/s/6e8eueyvnqa3jgs/en_US.news.train.14trigrams.nosins.csv?dl=1"
p11 <- getTopM_NgramsPlot(path, "News", 3, 10, txt_ang=90, vj=0.5, hj=0.9)
p11 <- p11 + coord_cartesian(ylim=c(0, 12000))
path <- "https://www.dropbox.com/s/6y0rvzd2bt45f1q/en_US.twitter.train.14trigrams.nosins.csv?dl=1"
p12 <- getTopM_NgramsPlot(path, "Twitter", 3, 10, txt_ang=90, vj=0.5, hj=0.9)
p12 <- p12 + coord_cartesian(ylim=c(0, 12000))

plot_grid(p10, p11, p12, labels = c("Fig 10:", "Fig 11:", "Fig 12:"),
          ncol = 3, align = 'h')