This report presents a high level exploration of the text corpora with a view to meet the following objectives.
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. The data set was downloaded from the course page here https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
While several languages were available, this report uses the following English language files which were extracted from the zipped archive. Below is the relevant listing from the data directory.
-rw-r–r– 1 TScantle 1049089 210160014 Aug 22 02:53 en_US.blogs.txt
-rw-r–r– 1 TScantle 1049089 205811889 Aug 24 05:30 en_US.news.txt
-rw-r–r– 1 TScantle 1049089 167105338 Aug 22 02:53 en_US.twitter.txt
Using a git bash shell and the “wc -l” command I determined that the data files had the following number of delimited lines. Here we see the file names and the number of lines.
Additionally, one interesting feature derived from the “wc -L” command was the size of the longest line, found in the blog file. Here we see the file names and the length of the longest line.
The files in the corpora are relatively large. Using the rbinom function, a representative sample of 1.0% of the text was copied into working files. These smaller files allowed for faster development of the data modelling procedures while still yielding statistically significant inferences about the dataset. These files were then combined as follows.
#setwd("~/Coursea/NLP")
source("RScripts/CapFunctions.R")
#generate the 1% sample files.
gen.samples()
#setwd("~/Coursea/NLP")
news.sample <- readLines("data/news.sample.txt")
blogs.sample <- readLines("data/blogs.sample.txt")
tweets.sample <- readLines("data/tweets.sample.txt")
# merge the sample into one file
all.sample <- paste(news.sample,blogs.sample,tweets.sample)
rm(news.sample,blogs.sample,tweets.sample)
The following data cleaning steps were performed
A custom tokenizer was written to perform these steps.
library(dplyr)
# first clean the data and return sentenses.
source("RScripts/CapFunctions.R")
all.sentences <- DataCleaner(all.sample)
library(NLP)
library(tm)
library(qdap)
library(data.table)
source("RScripts/CapFunctions.R")
# remove profanity #, and a few other categories of words
# profane list used: https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
tokens <- removeProfanity(all.sentences)
all.sentences <- unlist(strsplit(paste(tokens, collapse = " ", sep = "<s>")," <s> "))
all.words <- tokens[-grep("<s>",tokens)]
unique.words <- unique(all.words)
A number of tables and charts have been presented that explore the distributions of characters, words and several n-grams in the sample corpora.
Below on the left is a table showing word counts and percentages while on the right is seen a bar chart with the corresponding distribution of the top 50 words in the sample of the corpora. As expected several stop words dominate the list. A stop word is a commonly used word (such as “the”) that a sometimes ignored in text processing tasks. However stops words will not be removed from the final prediction algorithm.
Below on the left is a table based on the number of characters in each word. The table is showing counts and percentages while on the left is seen a bar chart with the corresponding distribution by word length. This distribution looks somewhat Poisson.
Below on the left is a table, based on unique occurrences, of the number of characters in a word. The tables is showing counts and percentages while on the left is seen a bar chart with the corresponding distribution by unique occurrence. This distribution looks reasonably normal with a long right side tail
Additionally, n-grams ( n= 1 to 5) plots were developed and are displayed below using the same general format. An n-gram is a string of one or more consecutive words.
# generate ngrams
ngrams1 <- NLP::ngrams(all.words,1)
ngrams2 <- NLP::ngrams(all.words,2)
ngrams3 <- NLP::ngrams(all.words,3)
ngrams4 <- NLP::ngrams(all.words,4)
ngrams5 <- NLP::ngrams(all.words,5)
#combine the tokens into one string and calculate the frequency counts
df.ng1 <- data.frame(table(sapply(ngrams1, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng2 <- data.frame(table(sapply(ngrams2, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng3 <- data.frame(table(sapply(ngrams3, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng4 <- data.frame(table(sapply(ngrams4, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
df.ng5 <- data.frame(table(sapply(ngrams5, function(x) {paste(unlist(x), collapse = " ", sep = " " )})))
# split the strings into pairs - prior ngram and next token pairs
#This function creates pairs of ngrams, split at n - 1,
# adds them to the dataframe and returns a data table
add.ng.pairs <- function(df, n) {
if(!is.numeric(n) | n < 2) stop("n must be > 1")
xy <- unlist(sapply(df$Var1, function(x) {
y <- as.character(x);
a <- paste(strsplit(y,split=" ")[[1]][1:(n-1)],collapse=" ", sep =" ");
b <- strsplit(y,split=" ")[[1]][n];
c("prior"=a,"pred"=b)
}))
df$prior <- xy[1,]
df$pred <- xy[2,]
df <- as.data.table(df)
setkey(df,cols="prior")
df
}
df.ng1 <- as.data.table(df.ng1)
setkey(df.ng1,cols="Var1")
df.ng2 <- add.ng.pairs(df.ng2,2)
df.ng3 <- add.ng.pairs(df.ng3,3)
df.ng4 <- add.ng.pairs(df.ng4,4)
df.ng5 <- add.ng.pairs(df.ng5,5)
# Plans for creating a prediction algorithm and Shiny app
MY plan is to build basic n-gram model using the exploratory analysis you performed. I can split my n-grams in two - a prior n-gram and a predicted word. I indend to use a back off model to habdel n-grams that have not been seen.
# a collection of functions used by the next word predictor program
# This function samples the 3 text files in the corpora and creates
# smaller working files.
gen.samples <- function() {
#This function generates random block of lines to skip
skip.chunks <- function(x){
brks = numeric()
j = 0
for(i in seq_along(x)){
if(x[i]) {
if(j == 0){
j <- 1
brks <- c(brks,0)
}
brks[length(brks)] <- brks[length(brks)] + 1
}
else j <- 0
}
brks
}
# This function samples a random 1.0% of the data file
sample.the.text <- function(seed=1,nlines=10,infile=con,outfile=con){
set.seed(seed)
# genarate a random skip index
skip.idx = rbinom(n=nlines, size=1, prob=0.99)
# calculated the break points
brks <- skip.chunks(skip.idx)
# sample the file
txt.sample <- sapply(brks, function(x) {
scan(file=infile, what = character(),
sep = "\n", nlines = 1, skip = x,
quiet = TRUE, skipNul = TRUE)
})
txt.sample = unlist(txt.sample)
# write the sample out to disk
writeLines(txt.sample,outfile)
}
#setwd("~/Coursea/NLP")
infile <- file("data/en_US.news.txt","rt")
outfile <- file("data/news.sample.txt","wt")
sample.the.text(seed=3456,nlines=1010242,infile=infile,outfile=outfile)
close(infile)
close(outfile)
infile <- file("data/en_US.blogs.txt","rt")
outfile <- file("data/blogs.sample.txt","wt")
sample.the.text(seed=3456,nlines=899288,infile=infile,outfile=outfile)
close(infile)
close(outfile)
infile <- file("data/en_US.twitter.txt","rt")
outfile <- file("data/tweets.sample.txt","wt")
sample.the.text(seed=3456,nlines=2360148,infile=infile,outfile=outfile)
close(infile)
close(outfile)
}
# This function replaces unicode punctuation with ascii
#https://digwp.com/2011/07/clean-up-weird-characters-in-database/
#encoding = 'UTF-8'
replace.unicodes <- function(x){
gsub("“",'\\“', # left quote
gsub("â€",'\\”', # right quote
gsub("‘"," ", # left single quote, use space
#gsub("‘","‘", # left single quote
#gsub("’","’", # right single quote
gsub("’","'", # right single quote, use apostrophe
gsub("—","–", # en dash
gsub("–","—", # em dash
gsub("•","-", # hyphen
#gsub("•","-", # hyphen
#gsub("…","…", # ellipsis
gsub("…"," ", # ellipsis
x)
)
)
)
)
)
)
)
}
#This function performs various tasks to clean the data
DataCleaner <- function(all.sample) {
# split on line break, and make lower case
all.lines <- unlist(tolower(strsplit(all.sample, "\b", perl=TRUE)))
# remove urls and possibly email addresses
all.lines <- gsub(pattern="\\w+:\\/{2}[\\d\\w-]+(\\.[\\d\\w-]+)*(?:(?:\\/[^\\s/]*))*|www[12]*\\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+|([0-9a-z-_@.&+]{2,}\\.[0-9a-z-_@.&+]{2,3}\\.[0-9a-z-_@.&+]{2,3}|[0-9a-z-_@.&+]{2,}\\.[0-9a-z-_@.&+]{2,3})", replacement = " " ,all.lines, perl = T)
# remove hash tags
all.lines <- gsub(pattern="#.+(?:=[[\u202F\u00A0]|[\\s\\b\\.\\?\\!])+", replacement = " " ,all.lines, perl = T)
# replace unicode punctuation (are seen as [[:alpha:]])
all.lines <- unlist(replace.unicodes(all.lines))
# replace all non-english-ascii-like characters with space
all.lines <- iconv(all.lines, "latin1", "ASCII", sub=" ")
# collaspe multiple dots
all.lines <- gsub(pattern="\\.{2,}", replacement = " " ,all.lines, perl = T)
# replace numbers, dates
all.lines <- gsub("([\\d:,$-\\/])*?(\\.)?[\\d]+( )?(am|pm|a.m|p.m|st|nd|th)?"," ",all.lines,perl = T)
#keep only words, apostraphe, end of sentence markets ## hyphen
all.lines <- gsub(pattern="[^\\'\\.\\?\\![:alpha:]]", replacement = " " ,all.lines, perl = T)
#remove trailing apostraphe
all.lines <- gsub(pattern="[\\']+(?=\\.|\\s|\\ |$)|\\ \\'", replacement = " " ,all.lines, perl = TRUE)
# remove words with 1:3 consecutive same letter triplets
all.lines <- gsub(pattern="(?:[\\sa-z]*)(([a-z]{1,3}))\\1{2,3}[a-z]*(?:[\\s\\.\\?\\!\\'\\b]{0})", replacement = " " ,all.lines, perl = T)
#expand and add space to embedded sentence markers, but not period
all.lines <- gsub(pattern="([\\!\\?])+(?=[[:alpha:]])", replacement = "\\1 " ,all.lines, perl = TRUE)
# Use one consistant end of sentence markey
all.lines <- gsub(pattern="[\\?\\!]+", replacement = "\\." ,all.lines, perl = T)
# collaspe space,dots
all.lines <- gsub(pattern=" \\.", replacement = "\\." ,all.lines, perl = T)
# collaspe multiple dots
all.lines <- gsub(pattern="\\.{2,}", replacement = " " ,all.lines, perl = T)
# collaspe multiple whitespace, or a quote or comma to one space
all.lines <- gsub(pattern='[\\"\\,\\(\\)[:space:]]+', replacement = " " ,all.lines, perl = T)
# remove leading or trailing spaces
all.lines <- gsub("^\\s|\\s+$","",all.lines, perl = TRUE)
# create sentenses
all.sentences <- unlist(strsplit(all.lines, "(\\.|\\?|\\!)([\\s|\\b]+|$)", perl=TRUE))
all.sentences <- gsub("$"," <s>",all.sentences, perl = TRUE)
# remove empty lines
all.sentences <- all.sentences[-grep("^ <s>$",all.sentences, perl = T)]
# remove leading dot
all.sentences <- gsub("[[:space:]]\\."," ",all.sentences, perl = TRUE)
all.sentences <- gsub("^\\.","",all.sentences, perl = TRUE)
all.sentences
}
removeProfanity <- function(all.sentences) {
cwords <- readLines("data/profanewords.txt",warn = FALSE)
all.words <- unlist(strsplit(all.sentences, "[[:space:]]", perl=TRUE))
unique.words <- unique(all.words)
j = integer()
for(i in seq_along(unique.words)){
pattern <- paste0("^",unique.words[i],"$")
if( is.null(unique.words[i]) || is.na(unique.words[i]) || nchar(unique.words[i])<1)
j <- c(j,grep(pattern,all.words, perl=T))
else
if (! unique.words[i]=="<s>")
if(nchar(unique.words[i])== 1 && ! unique.words[i] %in% c("a","i"))
#j <- c(j,grep(paste0("^[",unique.words[i],"]$"),all.words, perl=T))
j <- c(j,grep(pattern,all.words, perl=T))
else
if(length(grep(pattern,cwords))) j <- c(j,grep(pattern,all.words, perl=T))
}
all.words[-j]
}
# the following code is a typical ngram plot
library(ggplot2)
library(grid)
library(gridExtra)
sumcount <- sum(df.ng3$Freq)
df.ng3$Percent <- round(df.ng3$Freq / sumcount * 100 , 3)
df1 <- df.ng3[ order(-df.ng3[,2], df.ng3[,1]), ][1:50,c(1,2,5)]
p <- ggplot(df1, aes(x=reorder(Var1, Freq), y = Freq)) +
geom_bar(stat='identity', fill="brown") + coord_flip() + labs(x = "")
pg <- ggplotGrob(p)
mytheme <- gridExtra::ttheme_default(
core = list(fg_params=list(cex = .6, hjust=1, x=0.9)),
colhead = list(fg_params=list(cex = .6, hjust=1, x=0.9)),
rowhead = list(fg_params=list(cex = 1.0)))
tf <-tableGrob(df1,rows = NULL, theme = mytheme)
tf$heights <- unit(rep(1/nrow(tf), nrow(tf)), "npc")
t1 <- grobTree(rectGrob(gp=gpar(alpha=0.0)), textGrob(" "))
lay <- rbind(c(1,2),c(3,2))
grid.arrange(tf, pg, t1, ncol=2,
layout_matrix = cbind(c(1,3),c(2,2)),
heights=unit(c(6.5,6.5), c("in", "mm")),
widths = c(1,5),
top = "Trigram Counts")