1 Introduction

The goal of the capstone project is to create a predictive text model using a large text corpus of documents as a training dataset. The predictive model will be embedded into a Shiny application. To reach this goal, natural language processing techniques will be applied.

This report describes the main steps of the exploratory data analysis and summarize the major features identified in the dataset. At last, the plan for the creation of the predictive model and the related application will be presented.

2 Getting the Text Corpus

The training dataset is from a text corpus called HC Corpora and is provided by SwiftKey. The corpus is generated by a web crawler that browses publicly available sources. The crawler is set to check for one specific language at the time. The text is parsed to remove duplicate entries and to split lines. To ensure anonymization, approximately 50% of each entry is deleted.

## Set variables
main_dir <- "./data"
dest_file <- paste0(main_dir, "/Coursera-SwiftKey.zip")
data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

## Create directory if not existing
if(!file.exists(main_dir)) { dir.create(main_dir) }

## Download HC Corpora corpus in directory if not already present
if(!file.exists(dest_file)) { download.file(data_url, destfile = dest_file, mode = "wb") }

## Unzip HC Corpora corpus if not already done
if(!file.exists(paste0(main_dir, "/final"))) { unzip(zipfile = dest_file, exdir = main_dir) }

The corpus contains files for 4 different languages (English, German, Finnish and Russian).

For each language, the corpus are generated from 3 sources that are stored in specific files:

News (file with extension “.news.txt”)
Blogs (file with extension “.blogs.txt”)
Twitter feeds (file with extension “.twitter.txt”)

For this project, the focus will be put on the English corpus dataset.

## Path for each English file
blog_file <- "./data/final/en_US/en_US.blogs.txt"
news_file <- "./data/final/en_US/en_US.news.txt"
twitter_file <- "./data/final/en_US/en_US.twitter.txt"

## Read each file
blog <- file(blog_file, "rb") %>% read_lines(n_max = -1L, progress = F)
news <- file(news_file, "rb") %>% read_lines(n_max = -1L, progress = F)
twitter <- file(twitter_file, "rb") %>% read_lines(n_max = -1L, progress = F)

## Get size of each file
files_size <- sapply(list(BLOG=blog_file, NEWS=news_file, TWITTER=twitter_file), file.size)

## Get general information for each file
fileInfo <- t(sapply(list(BLOG=blog, NEWS=news, TWITTER=twitter), stri_stats_general))[, c("Lines", "Chars")]

## Generate table for basic statistics
knitr::kable(cbind(data.frame(files_size), fileInfo), row.names = TRUE, 
             caption = "Basic statistics for each file in the English corpus.",
             col.names = c("File Size (bytes)", "Number of lines", "Number of characters"), 
             format.args = list(decimal.mark = '.', big.mark = " ")) %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Table 2.1: Basic statistics for each file in the English corpus.
	File Size (bytes)	Number of lines	Number of characters
BLOG	210 160 014	899 288	206 824 382
NEWS	205 811 889	1 010 242	203 223 154
TWITTER	167 105 338	2 360 148	162 096 031

The basic characteristics of each file is shown in Table 2.1. When looking at the number of lines, each file is quite large and this may pose some challenge for the exploratory analysis and the following steps. The Twitter file is the larger in terms of the number of lines but the smaller in terms of the number of characters. This last point is consistent with the limit of 280 characters imposed by Twitter.

## Get word statistics for each file
word_info <- t(sapply(sapply(list(BLOG=blog, NEWS=news, TWITTER=twitter), 
                          stri_count_words), summary))

## Generate table for word statistics
knitr::kable(word_info, row.names = TRUE,
        caption = "Basic word statistics for each file in the English corpus.") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
   add_header_above(c("", "Word per line" = 6))

Table 2.2: Basic word statistics for each file in the English corpus.
	Word per line
	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
BLOG	0	9	28	41.75107	60	6726
NEWS	1	19	32	34.40997	46	1796
TWITTER	1	7	12	12.75063	18	47

The statistics for words per line in each file is shown in Table 2.2. The Blog file is the one with the widest range with a line containing 6726 words. Consistent with the limit of 280 characters, the Twitter file is the one with the lowest mean and variance.

3 Exploratory Data Analysis

Considering the size of the corpus, a smaller subsample of the dataset (5% of the original dataset) will be generated for the exploratory analysis using random sampling.

## Ensure reproducibility
set.seed(21141)

## Function that created the sample file using random sampling
prepareAndSaveSubset <- function(file_data, name) {
    file_split <- sample(seq(1, 2), size = length(file_data), replace = TRUE, prob = c(.05, .95))
    write_lines(file_data[file_split == 1],  path=paste0("./data/final/en_US_sample/en_US.", name, ".txt"))
}

## Create directory and save subset
if(!file.exists("./data/final/en_US_sample")){ dir.create("./data/final/en_US_sample") }
prepareAndSaveSubset(blog, "blogs")
prepareAndSaveSubset(news, "news")
prepareAndSaveSubset(twitter, "twitter")

3.1 Cleaning the Text Corpus

Cleaning data is an important step as predictive algorithms need standardized inputs. The cleaning enable the removal of the less useful parts of text. For this project, the cleaning steps consist of:

Capitalization removal
Number removal
Hashtag removal
Profanity words removal (using a list of words created by Robert J Gabriel)
Extra white space removal
Special character removal (or punctuation characters)

The special characters (or punctuation characters) such as commas, parentheses are removed under the assumption that they have little impact on word order.

Stopword (words like “the” or “and”) are kept as those are valuable in the context of this project.

# Create corpus using the sample dataset and remove non-ASCII character
corpus <- VCorpus(DirSource("./data/final/en_US_sample/"))  
corpus <- VCorpus(VectorSource(sapply(corpus, function(x) {iconv(x, "latin1", "ASCII", sub="")}))) 

# List of profanity words
download.file("https://raw.github.com/RobertJGabriel/Google-profanity-words/master/list.txt", 
              destfile = "./data/profanity_words.txt", mode = "wb")
profanity_words <- read_lines("./data/profanity_words.txt", progress = FALSE)

# Function to remove special characters
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]", "", x)

# Function to clean corpus
cleaningData <- function(sample) {
    return(tm_map(sample, removeSpecialChars) %>% tm_map(removeNumbers) %>% 
            tm_map(removeWords, profanity_words) %>% tm_map(removePunctuation) %>%
            tm_map(PlainTextDocument) %>% tm_map(stripWhitespace)) }

# Clean corpus
corpus <- cleaningData(corpus)

3.2 Tokenization and n-grams creation

Tokenization is the task of chopping text up into pieces; from text into sentences and words.From those words, n-grams can be computed. N-grams are contiguous sequences of n words. For this report, the corpus is going to be cut into words and n-grams formed of 1 to 4 words will be analyzed.

## Function to create n-grams and retain only the most frequent
calculateWordFrequency <- function(sample, nbrGram) {
  tkn <- function(x) unlist(lapply(ngrams(words(x), nbrGram), paste, collapse = "_"), use.names = FALSE)
  data_freq <- TermDocumentMatrix(sample, control=list(tokenize=tkn)) %>% as.matrix(na.rm=TRUE) %>% rowSums()
  data_freq <- data_freq[order(-data_freq)]
  return(data.frame(word = names(data_freq)[1:100], freq=data_freq[1:100])) }

## Create n-grams
unigram_freq <- calculateWordFrequency(corpus, 1)
bigram_freq <- calculateWordFrequency(corpus, 2)
trigram_freq <- calculateWordFrequency(corpus, 3)
tetragram_freq <- calculateWordFrequency(corpus, 4)

3.3 Word Cloud for the unigrams

A word cloud is created for the unigrams. The size of each words represents the number of times that those are seen.

## Create a word cloud for unigram
wordcloud(words = unigram_freq$word, scale=c(8,.8), freq = unigram_freq$freq, min.freq = 1,
          max.words=50, random.order=FALSE, rot.per=0.45, colors=alphabet())

Figure 3.1: Word cloud for the unigrams.

For the unigrams (Fig. 3.1), the word “the” is the most frequent word. Its frequency is almost the double of the second most frequent word.

3.4 N-Grams Frequency

The frequency has been calculated for n-grams formed of 2 to 4 words and the most frequent words are shown in Figure 3.2.

## Function to generate the histogram
createHist <- function(freq, title) {
    p <- ggplot(freq[1:20,], aes(x=reorder(word, freq), y=freq)) + geom_bar(stat='identity', fill="orange") +
           ggtitle(title) + coord_flip() + theme(axis.title.x=element_blank(), 
           axis.title.y=element_blank(), title=element_text(size=8, face = "bold", colour = c("darkblue")),
           axis.text.x=element_text(angle = 70, hjust = 1), plot.title = element_text(hjust = 0.5))
    return(p) }

grid.arrange(createHist(bigram_freq, "Most Frequent Bigrams"), createHist(trigram_freq, 
   "Most Frequent Trigrams"), createHist(tetragram_freq, "Most Frequent Tetragrams"), nrow = 1)

Figure 3.2: Most frequent words in n-grams formed of 2 to 4 words.

Not surprisingly, “the” is also present in all the top frequent n-grams. Stopwords are highly represented; as well as conjugation of the verb “to be” and “to have”.

4 Prediction Approach

The next step of this capstone project will be to develop an efficient predictive text algorithm. Once the algorithm ready, it will be embedded into a Shiny application.

The predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above. One possible strategy would be to use the trigram model to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed. Algorithms to deal with the unknown words should also be explored.

The user interface of the Shiny application will consist of a text input box that will allow a user to enter a phrase. Then the application will use the predictive algorithm to suggest the most likely next word. The Shiny application should be optimized to be responsible and user-friendly.

5 References

Wikipedia contributors. (2018, April 11). N-gram. In Wikipedia, The Free Encyclopedia. Retrieved 00:35, September 1, 2018, from https://en.wikipedia.org/w/index.php?title=N-gram&oldid=835900923

Gabriel R.J, Full List of Bad Words and Top Swear Words Banned by Google, from https://github.com/RobertJGabriel/Google-profanity-words

Feinerer I., Hornik K. et Meyer D. Text mining infrastructure in r, Journal of Statistical Software, 25 (5) (2008), pages 1-54.

Coursera Data Science Capstone - Milestone Report

A. Deschênes

September 10, 2018