Overview

This report presents a high level exploration of the text corpora with a view to meet the following objectives.

Source data

The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. The data set was downloaded from the course page here https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

While several languages were available, this report uses the following English language files which were extracted from the zipped archive. Below is the relevant listing from the data directory.

-rw-r–r– 1 TScantle 1049089 210160014 Aug 22 02:53 en_US.blogs.txt
-rw-r–r– 1 TScantle 1049089 205811889 Aug 24 05:30 en_US.news.txt
-rw-r–r– 1 TScantle 1049089 167105338 Aug 22 02:53 en_US.twitter.txt

Data preparation

Using a git bash shell and the “wc -l” command I determined that the data files had the following number of delimited lines. Here we see the file names and the number of lines.

Additionally, one interesting feature derived from the “wc -L” command was the size of the longest line, found in the blog file. Here we see the file names and the length of the longest line.

Sampling the data

The files in the corpora are relatively large. Using the rbinom function, a representative sample of 1.0% of the text was copied into working files. These smaller files allowed for faster development of the data modelling procedures while still yielding statistically significant inferences about the dataset. These files were then combined as follows.

#setwd("~/Coursea/NLP")
source("RScripts/CapFunctions.R")
#generate the 1% sample files.
gen.samples()
#setwd("~/Coursea/NLP")
news.sample <- readLines("data/news.sample.txt")
blogs.sample <- readLines("data/blogs.sample.txt")
tweets.sample <- readLines("data/tweets.sample.txt")
# merge the sample into one file
all.sample <- paste(news.sample,blogs.sample,tweets.sample)
rm(news.sample,blogs.sample,tweets.sample)

Cleaning the data

The following data cleaning steps were performed

  • Removing URLs and email addressed
  • Removing hash tags
  • Removing punctuation and non ascii characters
  • Removing profanity and other unwanted words
  • Collapsing white space
  • Splitting the data into lines, sentences and tokens

A custom tokenizer was written to perform these steps.

library(dplyr)
# first clean the data and return sentenses.
source("RScripts/CapFunctions.R")
all.sentences <- DataCleaner(all.sample) 
library(NLP)
library(tm)
library(qdap)
library(data.table)

source("RScripts/CapFunctions.R")

# remove profanity #, and a few other categories of words
# profane list used: https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
tokens <- removeProfanity(all.sentences)

all.sentences <- unlist(strsplit(paste(tokens, collapse = " ", sep = "<s>")," <s> "))
all.words <- tokens[-grep("<s>",tokens)]
unique.words <- unique(all.words) 

Exploratory Analysis

A number of tables and charts have been presented that explore the distributions of characters, words and several n-grams in the sample corpora.

Below on the left is a table showing word counts and percentages while on the right is seen a bar chart with the corresponding distribution of the top 50 words in the sample of the corpora. As expected several stop words dominate the list. A stop word is a commonly used word (such as “the”) that a sometimes ignored in text processing tasks. However stops words will not be removed from the final prediction algorithm.

Below on the left is a table based on the number of characters in each word. The table is showing counts and percentages while on the left is seen a bar chart with the corresponding distribution by word length. This distribution looks somewhat Poisson.

Below on the left is a table, based on unique occurrences, of the number of characters in a word. The tables is showing counts and percentages while on the left is seen a bar chart with the corresponding distribution by unique occurrence. This distribution looks reasonably normal with a long right side tail

Additionally, n-grams ( n= 1 to 5) plots were developed and are displayed below using the same general format. An n-gram is a string of one or more consecutive words.

# generate ngrams
ngrams1 <- NLP::ngrams(all.words,1)
ngrams2 <- NLP::ngrams(all.words,2)
ngrams3 <- NLP::ngrams(all.words,3)
ngrams4 <- NLP::ngrams(all.words,4)
ngrams5 <- NLP::ngrams(all.words,5)


#combine the tokens into one string and calculate the frequency counts
df.ng1 <- data.frame(table(sapply(ngrams1, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng2 <- data.frame(table(sapply(ngrams2, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng3 <- data.frame(table(sapply(ngrams3, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng4 <- data.frame(table(sapply(ngrams4, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng5 <- data.frame(table(sapply(ngrams5, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))

# split the strings into pairs - prior ngram and next token pairs
#This function creates pairs of ngrams, split at n - 1, 
# adds them to the dataframe and returns a data table
add.ng.pairs <- function(df, n) {
    if(!is.numeric(n) | n < 2) stop("n must be > 1")
    xy <- unlist(sapply(df$Var1, function(x) {
        y <- as.character(x); 
        a <- paste(strsplit(y,split=" ")[[1]][1:(n-1)],collapse=" ", sep =" ");
        b <- strsplit(y,split=" ")[[1]][n];
        c("prior"=a,"pred"=b)
    }))
    df$prior <- xy[1,]
    df$pred  <- xy[2,]
    df <- as.data.table(df)
    setkey(df,cols="prior")
    df
}
df.ng1 <- as.data.table(df.ng1)
setkey(df.ng1,cols="Var1")
df.ng2 <- add.ng.pairs(df.ng2,2)
df.ng3 <- add.ng.pairs(df.ng3,3)
df.ng4 <- add.ng.pairs(df.ng4,4)
df.ng5 <- add.ng.pairs(df.ng5,5)

# Plans for creating a prediction algorithm and Shiny app

MY plan is to build basic n-gram model using the exploratory analysis you performed. I can split my n-grams in two - a prior n-gram and a predicted word. I indend to use a back off model to habdel n-grams that have not been seen.

Appendix

# a collection of functions used by the next word predictor program

# This function samples the 3 text files in the corpora and creates 
# smaller working files.
gen.samples <- function() {
    
    #This function generates random block of lines to skip     
    skip.chunks <- function(x){
        brks = numeric()
        j = 0
        for(i in seq_along(x)){
            if(x[i]) {
                if(j == 0){
                    j <- 1
                    brks <- c(brks,0)
                }
                brks[length(brks)] <- brks[length(brks)] + 1
            } 
            else j <- 0
        } 
        brks
    }
    
    # This function samples a random 1.0% of the data file 
    sample.the.text <- function(seed=1,nlines=10,infile=con,outfile=con){
        set.seed(seed)
        # genarate a random skip index
        skip.idx = rbinom(n=nlines, size=1, prob=0.99)
        # calculated the break points
        brks <- skip.chunks(skip.idx)
        # sample the file
        txt.sample <- sapply(brks, function(x) {
            scan(file=infile, what = character(), 
                 sep = "\n", nlines = 1, skip = x, 
                 quiet = TRUE, skipNul = TRUE)
        })
        txt.sample = unlist(txt.sample)
        # write the sample out to disk
        writeLines(txt.sample,outfile)
    }
    
    #setwd("~/Coursea/NLP")
    infile <- file("data/en_US.news.txt","rt")
    outfile <- file("data/news.sample.txt","wt")
    sample.the.text(seed=3456,nlines=1010242,infile=infile,outfile=outfile)
    close(infile)
    close(outfile)
    
    infile <- file("data/en_US.blogs.txt","rt")
    outfile <- file("data/blogs.sample.txt","wt")
    sample.the.text(seed=3456,nlines=899288,infile=infile,outfile=outfile)
    close(infile)
    close(outfile)
    
    infile <- file("data/en_US.twitter.txt","rt")
    outfile <- file("data/tweets.sample.txt","wt")
    sample.the.text(seed=3456,nlines=2360148,infile=infile,outfile=outfile)
    close(infile)
    close(outfile)
    
} 

# This function replaces unicode punctuation with ascii 
#https://digwp.com/2011/07/clean-up-weird-characters-in-database/
#encoding = 'UTF-8'
replace.unicodes <- function(x){
    gsub("“",'\\“', # left quote
         gsub("â€",'\\”', # right quote
              gsub("‘"," ", # left single quote, use space
                   #gsub("‘","‘", # left single quote
                   #gsub("’","’", # right single quote
                   gsub("’","'", # right single quote, use apostrophe
                        gsub("—","–", # en dash
                             gsub("–","—", # em dash
                                  gsub("•","-", # hyphen
                                       #gsub("•","-", # hyphen
                                       #gsub("…","…", # ellipsis 
                                       gsub("…"," ", # ellipsis 
                                            x) 
                                  )
                             )
                        )
                   )
              )
         )
    )
}

#This function performs various tasks to clean the data
DataCleaner <- function(all.sample) {
    
    # split on line break, and make lower case
    all.lines <- unlist(tolower(strsplit(all.sample, "\b", perl=TRUE)))
    
    # remove urls and possibly email addresses
    all.lines <- gsub(pattern="\\w+:\\/{2}[\\d\\w-]+(\\.[\\d\\w-]+)*(?:(?:\\/[^\\s/]*))*|www[12]*\\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|([0-9a-z-_@.&+]{2,}\\.[0-9a-z-_@.&+]{2,3}\\.[0-9a-z-_@.&+]{2,3}|[0-9a-z-_@.&+]{2,}\\.[0-9a-z-_@.&+]{2,3})", replacement = " "  ,all.lines, perl = T)
    
    # remove hash tags
    all.lines <- gsub(pattern="#.+(?:=[[\u202F\u00A0]|[\\s\\b\\.\\?\\!])+", replacement = " "  ,all.lines, perl = T)
    
    # replace unicode punctuation (are seen as [[:alpha:]])
    all.lines <- unlist(replace.unicodes(all.lines))
    
    # replace all non-english-ascii-like characters with space
    all.lines <- iconv(all.lines, "latin1", "ASCII", sub=" ")
    
    # collaspe multiple dots
    all.lines <- gsub(pattern="\\.{2,}", replacement = " "  ,all.lines, perl = T)
    
    # replace numbers, dates
    all.lines <- gsub("([\\d:,$-\\/])*?(\\.)?[\\d]+( )?(am|pm|a.m|p.m|st|nd|th)?"," ",all.lines,perl = T)
    
    #keep only words, apostraphe, end of sentence markets ## hyphen
    all.lines <- gsub(pattern="[^\\'\\.\\?\\![:alpha:]]", replacement = " "  ,all.lines, perl = T)
    
    #remove trailing apostraphe
    all.lines <- gsub(pattern="[\\']+(?=\\.|\\s|\\ |$)|\\ \\'", replacement = " "  ,all.lines, perl = TRUE)
    
    # remove words with 1:3 consecutive same letter triplets
    all.lines <- gsub(pattern="(?:[\\sa-z]*)(([a-z]{1,3}))\\1{2,3}[a-z]*(?:[\\s\\.\\?\\!\\'\\b]{0})", replacement = " "  ,all.lines, perl = T)
    
    #expand and add space to embedded sentence markers, but not period
    all.lines <- gsub(pattern="([\\!\\?])+(?=[[:alpha:]])", replacement = "\\1 "  ,all.lines, perl = TRUE)
    
    # Use one consistant end of sentence markey
    all.lines <- gsub(pattern="[\\?\\!]+", replacement = "\\."  ,all.lines, perl = T)
    
    # collaspe space,dots
    all.lines <- gsub(pattern=" \\.", replacement = "\\."  ,all.lines, perl = T)
    
    # collaspe multiple dots
    all.lines <- gsub(pattern="\\.{2,}", replacement = " "  ,all.lines, perl = T)
    
    
    # collaspe multiple whitespace, or a quote or comma to one space
    all.lines <- gsub(pattern='[\\"\\,\\(\\)[:space:]]+', replacement = " "  ,all.lines, perl = T)
    
    
    # remove leading or trailing spaces
    all.lines <- gsub("^\\s|\\s+$","",all.lines, perl = TRUE)
    

    # create sentenses
    all.sentences <- unlist(strsplit(all.lines, "(\\.|\\?|\\!)([\\s|\\b]+|$)", perl=TRUE))
    
    all.sentences <- gsub("$"," <s>",all.sentences, perl = TRUE)

    # remove empty lines
    all.sentences <-  all.sentences[-grep("^ <s>$",all.sentences, perl = T)]

    # remove leading dot
    all.sentences <- gsub("[[:space:]]\\."," ",all.sentences, perl = TRUE)
    all.sentences <- gsub("^\\.","",all.sentences, perl = TRUE)
    
    all.sentences
    
    
}

removeProfanity <- function(all.sentences) {
    cwords <- readLines("data/profanewords.txt",warn = FALSE)
    all.words <- unlist(strsplit(all.sentences, "[[:space:]]", perl=TRUE))
    unique.words <- unique(all.words)
    j = integer()
    for(i in seq_along(unique.words)){
        pattern <- paste0("^",unique.words[i],"$")
        if( is.null(unique.words[i]) || is.na(unique.words[i]) || nchar(unique.words[i])<1)
            j <- c(j,grep(pattern,all.words, perl=T))
        else
            if (! unique.words[i]=="<s>") 
                if(nchar(unique.words[i])== 1 && ! unique.words[i] %in% c("a","i")) 
                    #j <- c(j,grep(paste0("^[",unique.words[i],"]$"),all.words, perl=T))
                    j <- c(j,grep(pattern,all.words, perl=T))
                else 
                    if(length(grep(pattern,cwords))) j <- c(j,grep(pattern,all.words, perl=T))
    }
    all.words[-j]
}

# the following code is a typical ngram plot
library(ggplot2)
library(grid)
library(gridExtra)

sumcount <- sum(df.ng3$Freq)
df.ng3$Percent <- round(df.ng3$Freq / sumcount * 100 , 3)

df1 <- df.ng3[ order(-df.ng3[,2], df.ng3[,1]), ][1:50,c(1,2,5)]

p <- ggplot(df1, aes(x=reorder(Var1, Freq), y = Freq)) + 
    geom_bar(stat='identity', fill="brown") + coord_flip() + labs(x = "")
pg <- ggplotGrob(p)

mytheme <- gridExtra::ttheme_default(
    core = list(fg_params=list(cex = .6, hjust=1, x=0.9)),
    colhead = list(fg_params=list(cex = .6, hjust=1, x=0.9)),
    rowhead = list(fg_params=list(cex = 1.0)))

tf <-tableGrob(df1,rows = NULL, theme = mytheme)
tf$heights <- unit(rep(1/nrow(tf), nrow(tf)), "npc")

t1 <- grobTree(rectGrob(gp=gpar(alpha=0.0)), textGrob(" "))


lay <- rbind(c(1,2),c(3,2))

grid.arrange(tf, pg, t1, ncol=2, 
             layout_matrix = cbind(c(1,3),c(2,2)),
             heights=unit(c(6.5,6.5), c("in", "mm")), 
             widths = c(1,5),
             top = "Trigram Counts")