Introduction

Milestone Report is the initial exploratory analysis of data set that a Swifty style next-word predicting algorithum will be built on eventually. The HC Corpora is provided by Swifty in the context of Coursera Data Science Capstone. It consist a set of text files in four different languages (ie. German(de), English(en), Finnish(fi), Russia(ru), each set contain three text files from three various sources (ie.news, blogs and twitter). Following analysis and the final algorithum shall focus on the English dataset.

Loading and Cleaning the Data

The three text files (ie en_US.twitter.txt, en_US.news.txt & en_US.blogs.txt) are downloaded. The file size are quite heavy (ie 159MB to 200MB) for PC to handle.

## en_US.twitter.txt    en_US.news.txt   en_US.blogs.txt 
##      "159.36  MB"      "196.28  MB"      "200.42  MB"

Data Partitions

Before exploring and fitting models, it is necessary to partition the data into three parts: training, validation and test dataset. The model will be fitted by choosing parameters which minimize error on the training set. Then the model will be tuned by minimizing error on the validation set. Finally, the model’s performance will be tested by test set.

The HC Corpora dataset will be splitted into 60/20/20 partition as training/validation/test datasets.

twitter.file.name <- 'en_US.twitter.txt'
news.file.name <- 'en_US.news.txt'
blogs.file.name <- 'en_US.blogs.txt'

splits <- c(0.6, 0.2, 0.2)
split.file(twitter.file.name, splits)

## [1] "Found 2360148 lines. in file en_US.twitter.txt"
## [1] "Dataset is splitted into training, testing, and cross validation set:"
## [1] "1417230 lines in training set."
## [1] "471078 lines in cross validation set."
## [1] "471840 lines in test set."

split.file(news.file.name, splits)

## [1] "Found 77259 lines. in file en_US.news.txt"
## [1] "Dataset is splitted into training, testing, and cross validation set:"
## [1] "46525 lines in training set."
## [1] "15375 lines in cross validation set."
## [1] "15359 lines in test set."

split.file(blogs.file.name, splits)

## [1] "Found 899288 lines. in file en_US.blogs.txt"
## [1] "Dataset is splitted into training, testing, and cross validation set:"
## [1] "538741 lines in training set."
## [1] "180126 lines in cross validation set."
## [1] "180421 lines in test set."

Pre-processing

It isn’t necessary to load the entire dataset in to fit the algorithms. As sample can be used to infer facts about a population, hence, it is preferable to use a smaller subset of the data.

In section below, randomly selected rows or chunks will be included to get an approximation to results that would be obtained using all the data.

Firstly, text will be read in chunks using R’s readLines functions.

Secondly, for exploratory purpose, a random sample of 3% are taken from each training files of the training dataset. Each line of the training files will be randonly sampled by binomial function (ie either 0 or 1) with 3% “successful” rate.

Creating Corpus & Data Cleaning

Here we create a corpus that consist of the three sample text files.

setwd("C:\\Users\\AngelT\\Documents\\Howard\\Capstone\\corpus\\final\\en_US\\sample")
sample.corpus <- c(sample.train.blogs,sample.train.news,sample.train.twitter)
sample.corpus.list<-list(sample.corpus)
my.corpus <- Corpus(VectorSource(sample.corpus.list))

Then we clean the data in corpus by:

Transformer words to lower case
Remove punctuation
Remove non-English words
Remove numbers
Remove decimals
Remove hashtags (#)
Remove excess white space

Often in text analysis, “stop words” would be removed as they are words that are not useful for analysising the meanings and content of the text. Stop words are extremely common words such as “the”, “of”, “to” etc.

However, I shall keep these stop words for two reasons. (1) the purpose of the algorithum is next-word prediction rather than meaning analysis. As stop words come up frequently in users’ typing, they must be kept. (2) stop words are essential part of common phrases too, remove stop words will lead us missing most of the phrases users type.

This will be illustrated in n-gram analysis section.

remove.hashtags <- function(x) {gsub("#[a-zA-z0-9]+", "", x)}
remove.decimals <- function(x) {gsub("([0-9]*)\\.([0-9]+)", "\\1 \\2", x)}

my.corpus <- tm_map(my.corpus, content_transformer(tolower))
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeNumbers)
my.corpus <- tm_map(my.corpus, remove.decimals)
my.corpus <- tm_map(my.corpus, remove.hashtags)

Profanity will be filtered as these are words our Apps prefer not to predict. A so called google badwords list is used as our profanity list. Please note the Google neither released the official list nor give any free license for any site to release it, so it is just a list of bad words which many developers and coders often use.

profanity <- read.delim("profanity.txt",sep = ":",header = FALSE)
profanity <- profanity[,1]
my.corpus <- tm_map(my.corpus, removeWords, profanity)

Remove excess white space.

my.corpus <- tm_map(my.corpus, stripWhitespace)

Tokenization

Here we identify appropriate tokens such as words and phrases for analysis in next stage.

Uni-gram Analysis

Uni-gram analysis show us the most frequently used words and their occurrence frequency. Here I use the Ngrams_Tokenizer that Maciej Szymkiewicz kindly made public.

library(ggplot2)
source("Ngram_Tokenizer.R")
unigram.tokenizer <- ngram_tokenizer(1)
wordlist <- unigram.tokenizer(my.corpus)
unigram.df <- data.frame(V1=as.vector(names(table(unlist(wordlist)))), 
                         V2 = as.numeric(table(unlist(wordlist))))
names(unigram.df) <- c("word","freq")
unigram.df <- unigram.df[with(unigram.df, order(-unigram.df$freq)),]
row.names(unigram.df) <- NULL
save(unigram.df, file="unigram.Rda")

ggplot(head(unigram.df,50), aes(x=reorder(word,-freq), y=freq)) +
  geom_bar(stat="Identity", fill="blue") +
  geom_text(aes(label=freq), vjust = -1, size=2, angle=0) +
  ggtitle("Uni-grams frequency") +
  ylab("Frequency") +
  xlab("Term")+
  theme(axis.text=element_text(angle = 90, size=9))

Coverage

Here we estimate how many unique words do we need to cover 50% and 90% of the language.

In Brown Corpus of American English text, consisting of over one million words, half of the word volume consists of repeated uses of only 135 words 1. The word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). The second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852).

Similar number is expected in our sampled training dataset. Based on uni-gram dataframe, following function add the frequency of each word in descending order until the accumulated frequency reaches 50% and 90% of total number of unique words.

total.words<-sum(unigram.df$freq);
  t1<-0.5*total.words;
  t2<-0.9*total.words;

  c<-0;
  i<-1;
  while(c<t1){
    c<-c+unigram.df[i,]$freq;
    i<-i+1;
  }
  r1<-i;
  r1

## [1] 121

  while(c<t2){
    c<-c+unigram.df[i,]$freq;
    i<-i+1;
  }
  r2<-i
  r2

## [1] 6799

Bi-gram Analysis

Bi-grams analysis show us the most frequently used two word phrases or combinations and their occurrence frequency.

bigram.tokenizer <- ngram_tokenizer(2)
wordlist <- bigram.tokenizer(my.corpus)
bigram.df <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(bigram.df) <- c("word","freq")
bigram.df <- bigram.df[with(bigram.df, order(-bigram.df$freq)),]
row.names(bigram.df) <- NULL
save(bigram.df, file="bigram.Rda")

ggplot(head(bigram.df,50), aes(x=reorder(word,-freq), y=freq)) +
  geom_bar(stat="Identity", fill="blue") +
  geom_text(aes(label=freq), vjust = -1, size=2, angle=0) +
  ggtitle("Bi-grams frequency") +
  ylab("Frequency") +
  xlab("Term")+
  theme(axis.text=element_text(angle = 90, size=9))

Tri-gram Analysis

Tri-grams analysis show us the most frequently used three word phrases or combinations and their occurrence frequency.

trigram.tokenizer <- ngram_tokenizer(3)
wordlist <- trigram.tokenizer(my.corpus)
trigram.df <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(trigram.df) <- c("word","freq")
trigram.df <- trigram.df[with(trigram.df, order(-trigram.df$freq)),]
row.names(trigram.df) <- NULL
save(trigram.df, file="trigram.Rda")

ggplot(head(trigram.df,50), aes(x=reorder(word,-freq), y=freq)) +
  geom_bar(stat="Identity", fill="blue") +
  geom_text(aes(label=freq), vjust = -1, size=2, angle=0) +
  ggtitle("Tri-grams frequency") +
  ylab("Frequency") +
  xlab("Term")+
  theme(axis.text=element_text(angle = 90, size=9))

Quadri-gram Analysis

Quadri-grams analysis show us the most frequently used four word phrases or combinations and their occurrence frequency.

quadrigram.tokenizer <- ngram_tokenizer(4)
wordlist <- quadrigram.tokenizer(my.corpus)
quadrigram.df <- data.frame(V1 = as.vector(names(table(unlist(wordlist)))), V2 = as.numeric(table(unlist(wordlist))))
names(quadrigram.df) <- c("word","freq")
quadrigram.df <-quadrigram.df[with(quadrigram.df, order(-quadrigram.df$freq)),]
row.names(quadrigram.df) <- NULL
save(quadrigram.df, file="quadrigram.Rda")

ggplot(head(quadrigram.df,50), aes(x=reorder(word,-freq), y=freq)) +
  geom_bar(stat="Identity", fill="blue") +
  geom_text(aes(label=freq), vjust = -1, size=2, angle=0) +
  ggtitle("Quadri-gram frequency") +
  ylab("Frequency") +
  xlab("Term")+
  theme(axis.text=element_text(angle = 90, size=9))

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  32151668 1717.1   64424022 3440.7  64424022 3440.7
## Vcells 256134493 1954.2  573244120 4373.6 570496644 4352.6

Direction of further analysis in next stage

Factors to be considered:

Stemming
Generally speaking, stemming is the process of reducing inflected (or derived) words to their word stem, base or root form—generally a written word form.
Suffix-stripping algorithms (eg. remove the ‘ed’, ‘ing’)
Lemmatisation algorithms (eg. applying different normalization rules for various part of speech)
Further work on n-gram
With list of all 3-grams and their occurence probability, the final Apps should be abled to filter all 3-grams starting with two query words. For example, “thanks” & “for” will likely followed by “the” according to aboved Quadri-gram Analysis.

Coursera - Data Science Capstone, Milestone Report

Howard Tsang

August 4, 2017