Milestone Report - Data Science Project

Executive Summary

This report provides documentation to describes the process and decisions used on the development of a predictive text model for the Data Science Capstone project. All codes used so far is also shared in this report. The discipline used to perform this task is Natural Language Processing, also know as NLP, and the chosen modeling technique is N-Gram, which is a special type of wordform that looks *(N - n) words into the past and possesses the memory less properties of a Markov model. In other words, it is a contiguous sequence of n items from a given sequence of text or speech.

This project will work on 1-Gram (N: n=1), 2-Gram (N: n=2) and 3-Gram (N: n=3) models. The basic building blocks of the models are unigrams, bigrams, and trigrams. After trying several R libraries, I settled to the following packages: ‘tm’ for text mining; ‘filehash’ for database cleaning, ‘tau’ to build the n-grams, ‘scales’ and ‘wordclound’ for visualization.

I also provide my plans and goals for the an eventual app and algorithm at the end of the present report.

Data Processing

This section briefly addresses the acquisition, processing, and exploration of the data. The main goal here is to understand the data and determine what should be done with it.The dataset worked was provided by SwiftKey, a software developer, in association with Johns Hopkins University, and comprises Corpora from Blogs, Twitter, and News outlets.

The dataset can be downloaded here and is composed of a zip file that includes blog posts, news articles, and Twitter tweets in four languages (English, German, Finnish, and Russian). I decided to work only with the English files.

The libraries used:

library('tm')

## Warning: package 'tm' was built under R version 3.3.2

## Loading required package: NLP

library('filehash')

## filehash: Simple Key-Value Database (2.3 2015-08-12)

library('tau')
library('wordcloud')

## Loading required package: RColorBrewer

library('scales')
library('stringi')
library('stringr')

I decided to use virtual corpus to read the three data sets, therefore I created 3 folders and associated one corpus per folder:

Twitter <- VCorpus(DirSource("twitter", encoding = "UTF-8"), readerControl = list(language="en"))

Blogs <- VCorpus(DirSource("blogs", encoding = "UTF-8"), readerControl = list(language="en"))

News <- VCorpus(DirSource("news", encoding = "UTF-8"), readerControl = list(language="en"))

Basic Exploratory Analysis

Due to memory problems and processing time knitting this report, I decided to show the exploratory analysis in a sample of 10% of the Corpora. To those that would like to see the full codes, I provide a Markdown file that can be accessed here.

Sampling the data sets:

set.seed(148)
Twitter.sample <- sample(Twitter[[1]][[1]], length(Twitter[[1]][[1]]))
Twitsample <- round(0.1*length(Twitter.sample))

News.sample <- sample(News[[1]][[1]], length(News[[1]][[1]]))
Newssample <- round(0.1*length(News.sample))

Blogs.sample <- sample(Blogs[[1]][[1]], length(Blogs[[1]][[1]]))
Blogsample <- round(0.1*length(Blogs.sample))

Basic readings:

summary(Twitsample)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  236000  236000  236000  236000  236000  236000

summary(nchar(Twitsample))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6

range(Twitsample)

## [1] 236015 236015

summary(Newssample)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7726    7726    7726    7726    7726    7726

summary(nchar(Newssample))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4       4       4       4       4       4

range(Newssample)

## [1] 7726 7726

summary(Blogsample)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   89930   89930   89930   89930   89930   89930

summary(nchar(Blogsample))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5       5       5       5       5       5

range(Blogsample)

## [1] 89929 89929

Inspecting the size of the full Datasets:

format(object.size(News), units = "MB")

## [1] "19.2 Mb"

format(object.size(Twitter), units = "MB")

## [1] "301.4 Mb"

format(object.size(Blogs), units = "MB")

## [1] "248.5 Mb"

To split corpora into train, devtest, and test sets, I decided to randomize the order of each corpus and then set the ratio to 60% in order to get the training set and divide the rest into testing and devtest sets. I decided to leave out of the report the Blog and News Corpora due to problems during the knitting process.

Twitter

set.seed(148)
perm.twitter <- sample(Twitter[[1]][[1]], length(Twitter[[1]][[1]]))
TwitR <- round(0.6*length(perm.twitter))
twitterTrain <- perm.twitter[1:TwitR]
remain <- perm.twitter[-(1:TwitR)]

DEV <- round(0.5*(length(remain)))
twitterDevTest <- remain[1:DEV]
twitterTest <- remain[-(1:DEV)]

write(twitterTrain, "twitterTrain.txt")
write(twitterDevTest, "twitterDevTest.txt")
write(twitterTest, "twitterTest.txt")
rm(list = ls())

Data Cleaning

For this matter I preprocess the dataset to remove profanity and rude words. To do so, I used data from the dataset FrontGateMedia which presents more than 700 such words. These words will be used as stopwords later on.

profanity <- read.csv("Terms-to-Block.csv")
profanity <- profanity[-c(1:3),]
profanity <- rep(profanity$Your.Gateway.to.the.Chrisitan.Audience)

I decided to direct the source to the “trainings” dataset. In order to do that, I added a folder “training” and included the Train.txt (blogs, news, twitter) files in it. I also created a folder named “modified” to hold in-process cleaning data.

Corpus <- PCorpus(DirSource("training", encoding = "UTF-8", mode = "text"),
                  dbControl = list(dbName="Corpus.db", dbType="DB1"))

Cleaning steps

Codes to convert to lower case, separate hyphenated and slashed words, convert symbol to apostrophe, provide progress to user and create end of sentence markers:

Corpus <- tm_map(Corpus, content_transformer(tolower)); dbInit("Corpus.db")

## Reorganizing database: 100% (1/1)
## Finished; reload database with 'dbInit'

## 'filehashDB1' database 'Corpus.db'

for(j in seq(Corpus)) {
  Corpus[[j]][[1]] <- gsub("-", " ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("/", " ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("<>", "\\'", Corpus[[j]][[1]])
  print("3 of 18 transformations complete")
  Corpus[[j]][[1]] <- gsub("\\. |\\.$","  <EOS> ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("\\? |\\?$","  <EOS> ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("\\! |\\!$","  <EOS> ", Corpus[[j]][[1]])
  print("6 of 18 transformations complete") 
}

## [1] "3 of 18 transformations complete"
## [1] "6 of 18 transformations complete"

Code to write corpus to permanent disc:

write(Corpus[[1]][[1]], "./modified/CorpusTrain.txt")

Codes to reads back, tranforms various ASCII codes to appropriate language, removes all punctuation except apostrophe and <> symbols in , removes web site URLs, removes all single letters except “a” and “i”:

Corpus <- PCorpus(DirSource("modified", encoding = "UTF-8", mode = "text"),
                  dbControl = list(dbName="halfCorpus.db", dbType="DB1"))

for(j in seq(Corpus)) {
  Corpus[[j]][[1]] <- gsub("<85>"," <EOS> ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("<92>","'", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("\\&", " and ", Corpus[[j]][[1]])
  print("9 of 18 transformations complete")
  Corpus[[j]][[1]] <- gsub("[^[:alnum:][:space:]\'<>]", " ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub(" www(.+) ", " ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub(" [b-hj-z] "," ", Corpus[[j]][[1]])
  print("12 of 18 transformations complete")
}

## [1] "9 of 18 transformations complete"
## [1] "12 of 18 transformations complete"

write(Corpus[[1]][[1]], "./modified/CorpusTrain.txt")

Codes to remove apostrophes introduced by transformations, errant codes in < > brackets, places numbers with a number marker for context and the errant <> brackets remaining:

Corpus <- PCorpus(DirSource("modified", encoding="UTF-8", mode = "text"), dbControl = list(dbName="lastCorpus.db", dbType="DB1"))

for(j in seq(Corpus)) {
  Corpus[[j]][[1]] <- gsub(" ' "," ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("\\' ", " ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub(" ' ", " ", Corpus[[j]][[1]])
  print("15 of 18 transformations complete")
  Corpus[[j]][[1]] <- gsub("<[^EOS].+>"," ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("[0-9]+"," <NUM> ", Corpus[[j]][[1]])
  Corpus[[j]][[1]] <- gsub("<>"," ", Corpus[[j]][[1]])
  print("18 of 18 transformations complete") 
}

## [1] "15 of 18 transformations complete"
## [1] "18 of 18 transformations complete"

Code to remove numbers and the “dbInit” function from ‘filehash’ package to compresses data in RAM:

Corpus <- tm_map(Corpus, removeNumbers); dbInit("lastCorpus.db")

## Reorganizing database: 100% (1/1)
## Finished; reload database with 'dbInit'

## 'filehashDB1' database 'lastCorpus.db'

Codes to remove errant ’s symbols not as contractions, close brackets starting a word and white spaces such as line breaks:

Corpus[[1]][[1]] <- gsub(" 's"," ", Corpus[[1]][[1]])
Corpus[[1]][[1]] <- gsub(">[a-z]"," ", Corpus[[1]][[1]])

Corpus <- tm_map(Corpus, stripWhitespace); dbInit("lastCorpus.db")

## Reorganizing database: 100% (1/1)
## Finished; reload database with 'dbInit'

## 'filehashDB1' database 'lastCorpus.db'

Code to Write final, processed corpus to disc for building n-grams:

write(Corpus[[1]][[1]], "./modified/CorpusTrain.txt")

One-Gram Model

The codes below uses CorpusTrain.txt to generate list of all 1-gram (unigrams). The library ‘tau’ is also used to build the n-grams.

Corpus <- PCorpus(DirSource("modified", encoding="UTF-8", mode = "text"), dbControl = list(dbName="aggCorpus.db", dbType="DB1"))

I also defined a source to pulls out the text element from the list Corpus:

CORP <- c(Corpus[[1]][[1]])

I opted to create a n-gram function instead of use a built-in one, such as NGramTokenizer available at RWeka package.

n.gram <- function(n) {
  textcnt(CORP, method = "string", n = as.integer(n),
          split = "[[:space:][:digit:]]+", decreasing = T)
}

Codes to build the one-gram model:

one.gram <- n.gram(1)
one.gram.DF <- data.frame(Uni = names(one.gram), counts = unclass(one.gram))
rm(one.gram)
one.gram.DF$Uni <- as.character(one.gram.DF$Uni)
one.gram.DF$counts <- as.numeric(one.gram.DF$counts)

Codes to remove the “words” and from one.gram data frame:

one.gram.DF <- one.gram.DF[which(one.gram.DF$Uni !="<eos>"),]
one.gram.DF <- one.gram.DF[which(one.gram.DF$Uni !="<num>"),]

Some exploratory analysis:

length(one.gram.DF$Uni)

## [1] 240700

In order to build a predictive model, the strategy selected was to build an N-Gram model augmented with Good-Turing Smoothing methods. Good-Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. It was developed by Alan Turing and his assistant Irving John Good as part of their efforts at Bletchley Park to crack German ciphers for the Enigma machine during World War II. Turing at first modeled the frequencies as a multinomial distribution, but found it inaccurate. Good developed smoothing algorithms to improve the estimator’s accuracy.

It was termed Good-Turing Discounting (also known as Good-Turing smoothing) and became a technique to re-estimate probability mass to assign N-Grams with zero or low counts by discounting from those occurring more often.

Code to build frequency of frequency table for Good-Turing smoothing:

one.freq.t <- data.frame(Uni=table(one.gram.DF$counts))

Code to write to csv files to speedy the process later:

write.csv(one.gram.DF, "one.gram.DF.csv")
write.csv(one.freq.t, "one.freq.t.csv")
rm(one.gram.DF, one.freq.t, CORP, Corpus)

Two-Gram Model

Codes to build the two-gram model. Here we reset the Corpus in order to define a new database to two.gram:

Corpus <- PCorpus(DirSource("modified", encoding="UTF-8", mode = "text"), dbControl = list(dbName="twogramCorpus.db", dbType="DB1"))

CORP <- c(Corpus[[1]][[1]])
rm(Corpus)

Codes to set the number of loop runs to process 10,000 docs per run:

step <- trunc(length(CORP)/10000)
remain <- length(CORP)-(step * 10000)
CORPport <- CORP[1:remain]

The two-gram model:

two.gram <- n.gram(2)
names(two.gram) <- gsub("^\'","", names(two.gram))
two.gram.df <- data.frame(Bi = names(two.gram), counts = unclass(two.gram))
names(two.gram.df) <- c("Bi", "counts")

Codes to remove the “words” “eos” and “num” from the two-gram database:

eost <- grepl("<eos>", two.gram.df$Bi)
two.gram.df <- two.gram.df[!eost,]
numt <- grepl("<num>", two.gram.df$Bi)
two.gram.df <- two.gram.df[!numt,]

Code to write the N:n=step dataframe:

write.csv(two.gram.df, "two.gram.def.csv")

Codes to build frequency of frequency table for Good-Turing smoothing:

two.freq.t <- data.frame(Bi=table(two.gram.df$counts))
write.csv(two.gram.df, "two.gram.df.csv")
write.csv(two.freq.t, "two.freq.t.csv")
rm(temp.two.gram.df, numt, eost, CORPport, two.gram)

## Warning in rm(temp.two.gram.df, numt, eost, CORPport, two.gram): objeto
## 'temp.two.gram.df' não encontrado

rm(remain, step, name, i, CORP)

## Warning in rm(remain, step, name, i, CORP): objeto 'name' não encontrado

## Warning in rm(remain, step, name, i, CORP): objeto 'i' não encontrado

rm(two.gram.df, two.freq.t)

Frequencies distribution of each n-gram

Codes to plot the N-Grams distributions:

one.gram.DF <- read.csv("one.gram.DF.csv")
two.gram.df <- read.csv("two.gram.df.csv")
par(mfrow=c(1,2))
dist.one <- round(0.5*dim(one.gram.DF)[[1]])
dist.two <- round(0.5*dim(two.gram.df)[[1]])

plot(log10(one.gram.DF$counts[1:dist.one]), ylab = "log10 (Frequency)", xlab = "Top 50% of One-Grams", col="darkslateblue", ylim = c(.00001,6))

plot(log10(two.gram.df$counts[1:dist.two]), ylab = "log10 (Frequency)", xlab = "Top 50% of Two-Grams", col="darkslateblue", ylim = c(.00001,6))

Codes to show frequencies of frequencies:

par(mfrow = c(1,2))
one.freq.t <- read.csv("one.freq.t.csv")
two.freq.t <- read.csv("two.freq.t.csv")

scatter.smooth(log10(one.freq.t$Uni.Var), log10(one.freq.t$Uni.Freq), ylab = "log10 (frequency of frequency)", xlab = "log10 (one-gram count)", col=alpha("black",0.1),ylim=c(.000001,7),xlim=c(.000001,6))

scatter.smooth(log10(two.freq.t$Bi.Var), log10(two.freq.t$Bi.Freq), ylab = "log10 (frequency of frequency)", xlab = "log10 (two-gram count)", col=alpha("black",0.1),ylim=c(.000001,7),xlim=c(.000001,6))

Next steps

Build a Three-Gram Model: Due to deadline and memory size problems to allocate large vectors, I decided to leave out of this report the Three-Gram Model. One of my challenges ahead is overcome this issue and build the model to be able to build a N-Gram predicitive app.

Build a N-Gram predicitive app: The general strategy is: (1) use counts based on Good-Turing Discounting to calculate N-Grams via N-Gram clustering. A cluster is a group of words (sometimes a large number) having the same likelihood of appearing in the prediction model. (2) use Katz back off (also known as back off N-Gram modeling) to estimates the conditional probability of a word given its history in the N-Gram. Developed by Slava M. Katz in 1987, it predicts first based on non-zero, higher-order N-Grams and “back off” to a lower-order N-gram if there is zero evidence of the higher-order N-Gram.

Create interactive Shiny App: This interactive web app will take in text input and return the predicted upcoming terms. It is being considered to allow exploration of the predicted terms through graphs and tables.