Data Science Capstone Project: Next Word Prediction Using NLP Techniques

Motivation

Natural Language Processing (NLP) is an important aspect of Artificial Intelligence, which includes machine learning, that contributes in finding efficient ways to communicate with humans and learn from the interactions with them. One such contribution is to present mobile users with predicted “next words,” as they type along in apps like WhatsApp, in an effort to expedite message delivery by having the user select a proposed word instead of having to type it.

The purpose of this project is to expose students to the basics of NLP. The student will analyze and clean unstructured textual data from various sources (i.e., blogs, news, and tweets), prepare an n-gram model with the resulting tidy data, and develop a “next word” prediction model that can be used and tested as a Shiny app.

Executive summary

This report describes the progress thus far achieved in completing the tasks of capturing the input data, analyzing and cleaning it, and using it to elaborate an n-gram model. The n-gram model is composed of three n-grams, namely, bigrams, trigrams, and quadgrams; as well as a term frequency vector and an ordered term vector. Both the n-grams and vectors are saved to disk, so that they can be used by the “next word” prediction model, whose development is the next step in this project.

The reason for creating bigrams, trigrams, and quadgrams is that project requirements state that “next word” prediction should take place using one, two, or three input words provided by the user.

Given the constraint that models of this kind operate on mobile devices with limited resources, as opposed to desktop computers, then runtime and resource usage (i.e., physical memory and storage space) should be carefully accounted for. To satisfy this condition, an important initial effort was made to ensure that n-gram model data is clean, precise, and small enough to handle a reasonable number of “next word” prediction situations.

To illustrate the analysis made, the appendix contains summaries and plots of the data in their various transformation states.

Objectives

Task 1

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Task 2

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Data processing

Exploratory analysis

The input files had garbage text such as repeated letter words (e.g., “aaaa”, “qqqqqqqq”).
The single-word, Document Term Matrix (DTM) showed a high number of sparse terms.
Although the texts are mostly in English, they contain words from other languages, particularly German.

Understanding word frequencies

It was not possible to produce a DTM for each n-gram because R kept reporting errors. For this reason, discovering bigram and trigram frequencies was also not possible. After some research, as well as trial and error attempts that were consuming too much time, the strategy was changed to create each n-gram with repetitions removed, and creating a term frequency vector for all individual words. Ultimately, this approach was beneficial because it helped create small enough sized files that can be quickly loaded and used by the eventual “next word” prediction model.
Increasing coverage for terms not found in the corpora. An initial (learning) idea is to have the “next word” prediction model capture the missing word (obviously typed by the user), and report it to a backend system that integrates the resulting phrase to the Corpus. At a scheduled time, the backend system will process the Corpus again, this time with the new phrase included, and update mobile devices with a new version of the n-gram model.

Basic n-gram model

Patterns

FILTER_PATTERN_01: characters not matching “a” through “z” (lower and uppercase), digits ranging from 0 to 9, the blank space, and the single quote character (’).
FILTER_PATTERN_02: text pattern starting with a single quote, followed by any number of letters ranging from “a” through “z” (lower and uppercase), and then followed by a blank space.
FILTER_PATTERN_03: text pattern matching truncated forms of the verbs “to do”, “to be”, and “to have”.
FILTER_PATTERN_04: “stand-alone” numbers, such as 2000.
SINGLE_CHARACTER_PATTERN: terms composed of one character, excluding “a”, “i”, and “I”.
REPEAT_CHARACTER_PATTERN: text patterns composed of a repetition of a single character (e.g., “aaaaa”).

Process

Words are converted to lower case. Although capitalized proper names will be lost, there are many more portions of the texts that contain unnecessary capitalized words.
All characters not forming part of FILTER_PATTERN_01 are removed. The blank space is included in the pattern just so that it is ignored. The single quote, usually located in between words (e.g., “don’t” and “they’ve”) is left in the texts. This way, these words can be processed later, that is, once the rest of the non-pattern characters have been removed.
All word portions matching FILTER_PATTERN_02 are removed. A first attempt at keeping words like “don’t”, and “you’ll” was made and actually worked fine on the Corpus text files. However, when creating the n-grams the NGramTokenizer() function removed the single quote characters anyway, leaving such words as “don t”, and “you ll”. Thus, the resulting n-grams contained garbage text (e.g., bigrams like “you ll” and “ll see”).
Upon completion of steps 1 - 3, any remaining punctuation is completely removed.
All word portions matching FILTER_PATTERN_03 are removed. These are usually truncated forms of the verbs “to do”, “to be”, and “to have”.
FILTER_PATTERN_04: Numbers are removed only when they appear as single words (e.g., 2000). Numbers embedded within alphabetical characters are left (e.g., r2d2, c3po, u2) as they might be used in certain communication contexts.
Stop words are not removed as they may be necessary to complete a given phrase.
Profanity words are not removed (for now…), as resulting n-grams end up not making much sense.
All single characters not forming part of SINGLE_CHARACTER_PATTERN are removed.
REPEAT_CHARACTER_PATTERN: All repeating character words, such as “zzz”, or “aaaaaaaaaa” are removed.
White space is stripped so that there is only one space between words.
Text samples. The idea is that, given that this project is to build a model (as opposed to a production application), the purpose is to test the word prediction algorithm with a subset of the data provided. Thus, text samples of TERM_SAMPLE_SIZE elements were obtained from the Corpus files.
New Corpus. The clean text sample is saved to disk, and used to create a new Corpus.
Term frequency vector. Created from the newly created Corpus. This way, both this vector and the n-grams share the same data.
Memory issues. To handle memory issues and prolonged processing time (some runs took approximately 24 hs. using the entire dataset), the aforementioned text samples were created. Additionally, data structures are deleted from memory as soon as they are no longer necessary, and the garbage collection function is executed immediately after.

Model to handle unseen n-grams

The planned strategy for the “next word” prediction model is as follows:

If a three-word input does not yield a quadgram, the input is reduced by removing the first word, and n-gram search is resumed with a two-word input.
If a two-word input does not yield a trigram, the input is reduced by removing the first word, and n-gram search is resumed with a one-word input.
If a one-word input does not yield a bigram, the model returns an empty string.

Conclusions

NLP requires applying various types of pattern discovery approaches aimed at eliminating noisy data.
R’s DTM processing over large datasets (over 100 MB in size) is rather inefficient, as runs may take hours to complete. This is the main reason why obtaining smaller data samples from the Corpus is more efficient.
Memory exhaustion issues are frequent. Care should be taken to remove data structures opportunely and to free up memory by calling the garbage collection function.
Some functions (e.g., NGramTokenizer) create undesirable side effects, such as eliminating some types of characters that are intended to remain within the texts. To work around this problem, special purpose patterns and functions need to be created.
Word frequencies help discover sparse terms that may be large in number and an overhead in n-gram creation, since they are seldom used.
A way to determine if the eventual “next word” model is any good is by first ensuring that the texts used for searching and matching are clean, concise, and manageable in size.
An efficient way to store an n-gram model is to save it as single-column, headerless (i.e., no column names), rows without names, plain text files. This way, loading them in memory as data tables and converting them to vectors is fast an easy.

Appendix

A. Exploratory data analysis

Original Corpus: structure, line count and word count per document

> # Corpus
> summary(docs_en)
                  Length Class             Mode
en_US.blogs.txt   2      PlainTextDocument list
en_US.news.txt    2      PlainTextDocument list
en_US.twitter.txt 2      PlainTextDocument list
> 
> str(docs_en)
List of 3
 $ en_US.blogs.txt  :List of 2
  ..$ content: chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â???ogodsâ???\u009d.""| __truncated__ "We love you Mr. Brown." "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together p"| __truncated__ "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all thes"| __truncated__ ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-05 02:01:44"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.blogs.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ en_US.news.txt   :List of 2
  ..$ content: chr [1:77259] "He wasn't home alone, apparently." "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotiv"| __truncated__ "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center,"| __truncated__ "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. Bu"| __truncated__ ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-05 02:01:44"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.news.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ en_US.twitter.txt:List of 2
  ..$ content: chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason." "they've decided its more fun if I don't." "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-05 02:01:44"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.twitter.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
> 
> # BLOG text file
> summary(docs_en[[EN_BLOG]])
        Length Class            Mode     
content 899288 -none-           character
meta         7 TextDocumentMeta list     
> 
> summary(docs_en[[EN_BLOG]]$content)
   Length     Class      Mode 
   899288 character character 
> 
> summary(docs_en[[EN_BLOG]]$meta)
              Length Class   Mode     
author        0      -none-  character
datetimestamp 1      POSIXlt list     
description   0      -none-  character
heading       0      -none-  character
id            1      -none-  character
language      1      -none-  character
origin        0      -none-  character
> 
> # Calculate number of blog words 
> words <- sapply(gregexpr("\\S+", docs_en[[EN_BLOG]]$content), length)
> 
> head(words)
[1]  16   5 140  40 111  10
> 
> # Report number of blog lines
> length(words)
[1] 899288
> 
> # Report number of blog words
> sum(words)
[1] 37334441
> 
> # NEWS text file
> summary(docs_en[[EN_NEWS]])
        Length Class            Mode     
content 77259  -none-           character
meta        7  TextDocumentMeta list     
> 
> summary(docs_en[[EN_NEWS]]$content)
   Length     Class      Mode 
    77259 character character 
> 
> summary(docs_en[[EN_NEWS]]$meta)
              Length Class   Mode     
author        0      -none-  character
datetimestamp 1      POSIXlt list     
description   0      -none-  character
heading       0      -none-  character
id            1      -none-  character
language      1      -none-  character
origin        0      -none-  character
> 
> # Calculate number of news words 
> words <- sapply(gregexpr("\\S+", docs_en[[EN_NEWS]]$content), length)
> 
> head(words)
[1]  5 29 29 85 40 38
> 
> # Report number of news lines
> length(words)
[1] 77259
> 
> # Report number of news words
> sum(words)
[1] 2643972
> 
> # TWITTER text file
> summary(docs_en[[EN_TWITTER]])
        Length  Class            Mode     
content 2360148 -none-           character
meta          7 TextDocumentMeta list     
> 
> summary(docs_en[[EN_TWITTER]]$content)
   Length     Class      Mode 
  2360148 character character 
> 
> summary(docs_en[[EN_TWITTER]]$meta)
              Length Class   Mode     
author        0      -none-  character
datetimestamp 1      POSIXlt list     
description   0      -none-  character
heading       0      -none-  character
id            1      -none-  character
language      1      -none-  character
origin        0      -none-  character
> 
> # Calculate number of twitter words 
> words <- sapply(gregexpr("\\S+", docs_en[[EN_TWITTER]]$content), length)
> 
> head(words)
[1] 24 19  8 20 11 14
> 
> # Report number of twitter lines
> length(words)
[1] 2360148
> 
> # Report number of twitter words
> sum(words)
[1] 30373792
>

B. N-gram model creation

Source code

# capstone_DataProcessing.R
#
# CREATION DATE
# 2016-09-01

# AUTHOR
# Herbert Barrientos

# DESCRIPTION
# Process that captures input data from various sources (blogs, news tweets), cleans it, 
# and uses it to create three n-grams (i.e., bigrams, trigrams, and quadgrams) that are 
# saved to disk. This input data is also used to create a word frequency vector that is 
# also saved to disk. 
#
# All saved files will be used by the word-predicting algorithm.
# 
# DATA PROCESSING DECISIONS
# 1. Words are converted to lower case. Although capitalized proper names will be lost, 
#    there are many more portions of the texts that contain unnecessary capitalized words.
#
# 2. All characters not forming part of FILTER_PATTERN_01 are removed. The blank space is 
#    included in the pattern just so that it is ignored. The single quote, usually located 
#    in between words (e.g., "don't" and "they've") is left in the texts. This way, these 
#    words can be processed later, that is, once the rest of the non-pattern characters 
#    have been removed.
# 
# 3. All word portions matching FILTER_PATTERN_02 are removed. A first attempt at keeping 
#    words like "don't", and "you'll" was made and actually worked fine on the Corpus text 
#    files. However, when creating the n-grams the NGramTokenizer() function removed the 
#    single quote characters anyway, leaving such words as "don t", and "you ll". Thus, the 
#    resulting n-grams contained garbage text (e.g., bigrams like "you ll" and "ll see").
#
# 4. Upon completion of steps 1 - 3, any remaining punctuation is completely removed.
#
# 5. All word portions matching FILTER_PATTERN_03 are removed. These are usually truncated 
#    forms of the verbs "to do", "to be", and "to have".
# 
# 6. FILTER_PATTERN_04: Numbers are removed only when they appear as single words (e.g., 2000). 
#    Numbers embedded within alphabetical characters are left (e.g., r2d2, c3po, u2) as they 
#    might be used in certain communication contexts.
#
# 7. Stop words are not removed as they may be necessary to complete a given phrase.
#
# 8. Profanity words are not removed (for now...).
#
# 9. All single characters not forming part of SINGLE_CHARACTER_PATTERN are removed.
#
# 10. REPEAT_CHARACTER_PATTERN: All repeating character words, such as "zzz", or "aaaaaaaaaa" 
#     are removed.
#
# 11. White space is stripped so that there is only one space between words.
#
# 12. Text samples. The idea is that, given that this project is to build a model (as opposed 
#     to a production application), the purpose is to test the word prediction algorithm with 
#     a subset of the data provided. Thus, text samples of TERM_SAMPLE_SIZE elements were 
#     obtained from the Corpus files.
#
# 13. New Corpus. The clean text sample is saved to disk, and used to create a new Corpus.
#
# 14. Term frequency vector. Created from the newly created Corpus. This way, both this vector 
#     and the n-grams share the same data.
#
# 15. Memory issues. To handle memory issues and prolonged processing time (some runs took 
#     approximately 24 hs. using the entire dataset), the aforementioned text samples were 
#     created. Additionally, data structures are deleted from memory as soon as they are no 
#     longer necessary, and the garbage collection function is executed immediately after.
# 
# HISTORY REVISION
#


# #######################
# ###### LIBRARIES ######
# #######################
library(clue)
library(tm)
library(RWeka)
library(ggplot2)


# #######################
# ###### CONSTANTS ######
# #######################

# Set Corpus document indices
EN_BLOG    <- 1
EN_NEWS    <- 2
EN_TWITTER <- 3

# Set textual patterns
BLANK_SPACE <- " "
TERM_SAMPLE_SIZE <- 30000

FILTER_PATTERN_01 <- "[^a-zA-Z 0-9']"
FILTER_PATTERN_02 <- "'[a-zA-Z]*(?!\\B|-)\\s"
FILTER_PATTERN_03 <- "( don | doesn | wasn | weren | re | ve | m | s | t )"
FILTER_PATTERN_04 <- "\\s*(?<!\\B|-)\\d+(?!\\B|-)\\s*"

SINGLE_CHARACTER_PATTERN <- "^([^aiI] )|( [^aiI] )|( [^aiI])$"
REPEAT_CHARACTER_PATTERN <- " (.)\\1{1,} "


# #######################
# ###### FUNCTIONS ######
# #######################

# Create the toSpace content transformer
toSpace <- content_transformer( 
               function(x, pattern) { return (gsub(pattern, BLANK_SPACE, x, perl = TRUE)) } )

# Abstract the NGramTokenizer function 
ngramTokenizer <- function(x, n, m) NGramTokenizer(x, Weka_control(min = n, max = m))

# Create interfaces for bigrams and trigrams
bigramTokenizer   <- function(x) ngramTokenizer(x, 2, 2)
trigramTokenizer  <- function(x) ngramTokenizer(x, 3, 3)
quadgramTokenizer <- function(x) ngramTokenizer(x, 4, 4)


# #######################
# ##### DIRECTORIES #####
# #######################

# Set directory of text files in EN
wd     <- paste0(getwd(), "/capstone")
dir_en <- paste0(wd, "/final/en_US/")

dir_en_processed <- paste0(wd, "/final/en_US_processed/")
word_predictor   <- paste0(wd, "/wordpredictor/")


# #############################
# ##### OUTPUT FILE NAMES #####
# #############################

BIGRAM_FNAME   <- paste0(word_predictor, "bigram.txt")
TRIGRAM_FNAME  <- paste0(word_predictor, "trigram.txt")
QUADGRAM_FNAME <- paste0(word_predictor, "quadgram.txt")

PROCESSED_CORPUS_FNAME <- paste0(dir_en_processed, "processed_corpus.txt")
TERM_FREQ_VECTOR_FNAME <- paste0(dir_en_processed, "term_freq_vector.txt")
TERM_FREQ_NAMES_FNAME  <- paste0(dir_en_processed, "term_freq_names.txt")
TERM_FREQ_ORDERED_VECTOR_FNAME <- paste0(dir_en_processed, "term_freq_ord_vector.txt")


# #######################
# ### CORPUS CREATION ###
# #######################

# Create the Corpus for text files in EN
cname_en <- dir_en
docs_en  <- Corpus(DirSource(cname_en), readerControl = list(reader = readPlain)) 

# Inspect the Corpus
str(docs_en)
summary(docs_en)


# #######################
# #### DATA CLEANING ####
# #######################

# Set all Corpus text to lower case
docs_en <- tm_map(docs_en,content_transformer(tolower))
str(docs_en)

# Replace all characters not forming part of FILTER_PATTERN_01 with a blank space
docs_en <- tm_map(docs_en, toSpace, FILTER_PATTERN_01)
str(docs_en)

# Replace all character sequences matching FILTER_PATTERN_02 with a blank space
docs_en <- tm_map(docs_en, toSpace, FILTER_PATTERN_02)
str(docs_en)

# Remove any remaining punctuation
docs_en <- tm_map(docs_en, removePunctuation)
str(docs_en)

# Remove truncated forms of the verbs "to do", "to be", and "to have"
docs_en <- tm_map(docs_en, toSpace, FILTER_PATTERN_03)
str(docs_en)

# Replace all numbers appearing as words (FILTER_PATTERN_04) with a blank space
docs_en <- tm_map(docs_en, toSpace, FILTER_PATTERN_04)
str(docs_en)

# Replace all single character words not forming of SINGLE_CHARACTER_PATTERN with a blank space
docs_en <- tm_map(docs_en, toSpace, SINGLE_CHARACTER_PATTERN)
str(docs_en)

# Replace all repeating character words with a blank space
docs_en <- tm_map(docs_en, toSpace, REPEAT_CHARACTER_PATTERN)
str(docs_en)

# Strip white space
docs_en <- tm_map(docs_en, stripWhitespace)
str(docs_en)


# ###############################
# #### N-GRAM MODEL CREATION ####
# ###############################

# Make the sampling reproducible
set.seed(1234)

# Get a sample from the BLOG text file in the Corpus
t1 <- sample(docs_en[[EN_BLOG]]$content, TERM_SAMPLE_SIZE)

# Get a sample from the NEWS text file in the Corpus
t2 <- sample(docs_en[[EN_NEWS]]$content, TERM_SAMPLE_SIZE)

# Get a sample from the TWITTER text file in the Corpus
t3 <- sample(docs_en[[EN_TWITTER]]$content, TERM_SAMPLE_SIZE)

# Join all samples into one vector and free up memory
tsamples <- c(t1, t2, t3)
rm(t1); rm(t2); rm(t3); gc()

# Create a bigram, remove repetitions, save it to disk, and free up memory
bigram <- bigramTokenizer(tsamples)
bigram <- unique(bigram)
write.table(bigram, BIGRAM_FNAME, row.names = FALSE, col.names = FALSE)
rm(bigram); gc()

# Create a trigram, remove repetitions, save it to disk, and free up memory
trigram <- trigramTokenizer(tsamples)
trigram <- unique(trigram)
write.table(trigram, TRIGRAM_FNAME, row.names = FALSE, col.names = FALSE)
rm(trigram); gc()

# Create a quadgram, remove repetitions, save it to disk, and free up memory
quadgram <- quadgramTokenizer(tsamples)
quadgram <- unique(quadgram)
write.table(quadgram, QUADGRAM_FNAME, row.names = FALSE, col.names = FALSE)
rm(quadgram); gc()

# Save tsamples in order to create a new, refined Corpus
write.table(tsamples, PROCESSED_CORPUS_FNAME, row.names = FALSE, col.names = FALSE, quote = FALSE)

# Free up memory
rm(tsamples); rm(docs_en); gc()


# ########################################
# #### TERM FREQUENCY VECTOR CREATION ####
# ########################################

# Create a new Corpus with clean data for text files in EN
cname_en <- dir_en_processed
docs_en  <- Corpus(DirSource(cname_en), readerControl = list(reader = readPlain)) 

# Inspect the Corpus
str(docs_en)
summary(docs_en)

# Create the Document Term Matrix (DTM)
dtm <- DocumentTermMatrix(docs_en)

# Inspect the DTM
dtm

# Free up memory by removing the Corpus. Not needed anymore
rm(docs_en); gc()

# IMPORTANT: Remove the least frequent terms. There are many sparse terms 
# (mostly with frequencies between 1 - 9) that are useless
dtm_withoutSparseTerms <- removeSparseTerms(dtm, 0.34)

# Calculate the cumulative frequencies of words across documents [FREQUENCY VECTOR]
freqr <- colSums(as.matrix(dtm_withoutSparseTerms))

# Free up memory
rm(dtm); rm(dtm_withoutSparseTerms); gc()

# Create sort vector in descending order of frequency
ordr <- order(freqr, decreasing=TRUE)

# Write vectors to disk
freqr_names <- names(freqr)

write.table(freqr, TERM_FREQ_VECTOR_FNAME, row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(freqr_names, TERM_FREQ_NAMES_FNAME, row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(ordr,  TERM_FREQ_ORDERED_VECTOR_FNAME, row.names = FALSE, col.names = FALSE, quote = FALSE)

Data transformations: first and final states

> # Set all Corpus text to lower case
> docs_en <- tm_map(docs_en,content_transformer(tolower))
> str(docs_en)
List of 3
 $ en_US.blogs.txt  :List of 2
  ..$ content: chr [1:899288] "in the years thereafter, most of the oil fields and platforms were named after pagan â???ogodsâ???\u009d.""| __truncated__ "we love you mr. brown." "chad has been awesome with the kids and holding down the fort while i work later than usual! the kids have been busy together p"| __truncated__ "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all thes"| __truncated__ ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-04 18:05:32"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.blogs.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ en_US.news.txt   :List of 2
  ..$ content: chr [1:77259] "he wasn't home alone, apparently." "the st. louis plant had to close. it would die of old age. workers had been making cars there since the onset of mass automotiv"| __truncated__ "wsu's plans quickly became a hot topic on local online sites. though most people applauded plans for the new biomedical center,"| __truncated__ "the alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works. bu"| __truncated__ ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-04 18:05:32"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.news.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ en_US.twitter.txt:List of 2
  ..$ content: chr [1:2360148] "how are you? btw thanks for the rt. you gonna be in dc anytime soon? love to see you. been way, way too long." "when you meet someone special... you'll know. your heart will beat more rapidly and you'll smile for no reason." "they've decided its more fun if i don't." "so tired d; played lazer tag & ran a lot d; ughh going to sleep like in 5 minutes ;)" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-04 18:05:32"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.twitter.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
>

> # Strip white space
> docs_en <- tm_map(docs_en, stripWhitespace)
> str(docs_en)
List of 3
 $ en_US.blogs.txt  :List of 2
  ..$ content: chr [1:899288] "in the years thereafter most of the oil fields and platforms were named after pagan gods " "we love you mr brown " "chad has been awesome with the kids and holding down the fort while i work later than usual the kids have been busy together pl"| __truncated__ "so anyways i am going to share some home decor inspiration that i have been storing in my folder on the puter i have all these "| __truncated__ ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-04 18:05:32"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.blogs.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ en_US.news.txt   :List of 2
  ..$ content: chr [1:77259] "he home alone apparently " "the st louis plant had to close it would die of old age workers had been making cars there since the onset of mass automotive p"| __truncated__ "wsu plans quickly became a hot topic on local online sites though most people applauded plans for the new biomedical center man"| __truncated__ "the alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works but"| __truncated__ ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-04 18:05:32"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.news.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 $ en_US.twitter.txt:List of 2
  ..$ content: chr [1:2360148] "how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long " "when you meet someone special you know your heart will beat more rapidly and you smile for no reason " "they decided its more fun if i " "so tired played lazer tag ran a lot ughh going to sleep like in minutes " ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-09-04 18:05:32"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "en_US.twitter.txt"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
>

Results from the Document Term Matrix (DTM)

> # Find the most frequent words
> freqr[head(ordr, 50)]
   the    and   that    for    you   with    was   this   have    but    are    not   from 
132969  69109  30493  27716  22005  19369  17509  14333  13914  12985  12563  10715  10603 
  they   said    his    all   will    one    can  about    out    has   what    who   just 
  9999   8962   8935   8446   8265   8016   7798   7653   7398   7330   7294   6921   6913 
  when  there   more    had   like   your    her    she  their   time  would   some   been 
  6868   6582   6492   6460   6389   6300   6284   6262   5961   5800   5505   5228   5106 
  were    get    new    our   them  which    how    now people   year   also 
  5094   5089   4768   4739   4513   4357   4229   4190   4104   4006   3903 
> 
> # Find the least frequent words
> freqr[tail(ordr, 50)]
        zold         zomb   zombaphobe    zombified zombocalypse          zon         zona 
           1            1            1            1            1            1            1 
  zonexpress       zonked       zonnig      zontini     zookster    zoolander       zoomba 
           1            1            1            1            1            1            1 
     zoonooz      zopatti         zori      zorione    zoroaster         zoth         zour 
           1            1            1            1            1            1            1 
  zoutendijk         zsef     zswagger          zte     zuccotti         zuck    zuckerman 
           1            1            1            1            1            1            1 
        zuey          zug       zugibe        zukin       zukini     zulaikha     zululand 
           1            1            1            1            1            1            1 
      zumaya   zumbatomic        zupan      zurawik       zutter         zuzu         zuzz 
           1            1            1            1            1            1            1 
       zwart     zwilling    zwolinski         zygi       zygote      zyhylij      zylstra 
           1            1            1            1            1            1            1 
    zylxxxyx 
           1 
>

# Plot a histogram for the most frequent terms 
wf <- data.frame(term=names(freqr),occurrences=freqr)
p <- ggplot(subset(wf, freqr>10000), aes(term, occurrences))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
>

Fig 1. Frequent terms

wf <- data.frame(term=names(freqr),occurrences=freqr)
p <- ggplot(subset(wf, freqr>=5000 & freqr<=9999), aes(term, occurrences))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p

Fig 2. Frequent terms

Resulting n-gram samples

BIGRAMS
"i know"
"know maybe"
"maybe they"
"they getting"
"getting too"
"too much"
"much sun"
"sun i"
"i think"
"think i"
"i going"
"going to"
"to cut"
"cut them"
"them way"
"way back"
"back i"
"i replied"
"the reason"
"reason could"
"could be"
"be anything"
"anything maybe"
"maybe you"
"you violated"
"violated some"
"some arcane"
"arcane meaningless"
"meaningless regulation"
"regulation among"
"among the"
"the hundreds"
"hundreds of"
"of thousands"
"thousands of"
"of pages"
"pages of"
"of us"
"us code"
"code ignorance"
"ignorance of"
...

TRIGRAMS
"i know maybe"
"know maybe they"
"maybe they getting"
"they getting too"
"getting too much"
"too much sun"
"much sun i"
"sun i think"
"i think i"
"think i going"
"i going to"
"going to cut"
"to cut them"
"cut them way"
"them way back"
"way back i"
"back i replied"
"the reason could"
"reason could be"
"could be anything"
"be anything maybe"
"anything maybe you"
"maybe you violated"
"you violated some"
"violated some arcane"
"some arcane meaningless"
"arcane meaningless regulation"
"meaningless regulation among"
"regulation among the"
"among the hundreds"
"the hundreds of"
"hundreds of thousands"
"of thousands of"
"thousands of pages"
"of pages of"
"pages of us"
"of us code"
"us code ignorance"
"code ignorance of"
...

QUADGRAMS
"i know maybe they"
"know maybe they getting"
"maybe they getting too"
"they getting too much"
"getting too much sun"
"too much sun i"
"much sun i think"
"sun i think i"
"i think i going"
"think i going to"
"i going to cut"
"going to cut them"
"to cut them way"
"cut them way back"
"them way back i"
"way back i replied"
"the reason could be"
"reason could be anything"
"could be anything maybe"
"be anything maybe you"
"anything maybe you violated"
"maybe you violated some"
"you violated some arcane"
"violated some arcane meaningless"
"some arcane meaningless regulation"
"arcane meaningless regulation among"
"meaningless regulation among the"
"regulation among the hundreds"
"among the hundreds of"
"the hundreds of thousands"
"hundreds of thousands of"
"of thousands of pages"
"thousands of pages of"
"of pages of us"
"pages of us code"
"of us code ignorance"
"us code ignorance of"
"code ignorance of the"
...

C. Working environment

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.1.0 RWeka_0.4-29  tm_0.6-2      NLP_0.1-9     clue_0.3-51  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6       digest_0.6.10     slam_0.1-38       grid_3.3.1        plyr_1.8.4       
 [6] gtable_0.2.0      magrittr_1.5      evaluate_0.9      scales_0.4.0      stringi_1.1.1    
[11] RWekajars_3.9.0-1 rmarkdown_1.0     tools_3.3.1       stringr_1.1.0     munsell_0.4.3    
[16] rsconnect_0.4.3   yaml_2.1.13       parallel_3.3.1    colorspace_1.2-6  cluster_2.0.4    
[21] htmltools_0.3.5   rJava_0.9-8      
>

Data Science Capstone Project: Next Word Prediction Using NLP Techniques - Milestone Report Nr. 1

Herbert Barrientos

September 04, 2016