Introduction

One of the most exciting emerging fields in the realm of data science is Natural Language Processing (NLP). This is the ability for computers to interpret the meaning of natural human language. The appliations for NLP are numerous, and with extensive research in the field, the applications are becoming greater and more intricate all the time. From search engines to digital assistants (think Siri and Google Now) to online form auto-completion; NLP is used far and wide.

We will be building a text prediction algorithm based on a body of text (corpus) which can be downloaded at this link: Corpus. Our motivation for this is that business and personal communications alike continue to move more and more onto mobile platforms that by definition have limited space for a keyboard. Having extensive converations on keyboard barely large enough to type on can become cumbersome, we hope to create an app that can accurately predict the next word in a sentance based on the previous few words, this could greatly reduce the amount of keystrokes needed to write a sentance.

Raw Data

Our corpus contains 3 files, each containing a body of text from a different media source. The sources are blogs, news and twitter. Below is a brief summary of the three text files we will be working with.

	Twitter File	News File	Blog File
File Size (MB)	159.4	196.3	200.4
Words	30,373,543	2,643,969	37,334,131
Lines	2,360,148	77,259	899,288

Loading and Preprocessing

The R package tm provides a framework for working with bodies of text for NLP. The below code chunk loads and takes 5% samples of each file.

set.seed(12) # Setting seed to ensure reproducibility of sampling
twtr <- readLines("en_US.twitter.txt")
blog <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
blog.sampled <- sample(blog, length(blog)/20)
twtr.sampled <- sample(twtr, length(twtr)/20)
news.sampled <- sample(news, length(news)/20)

Next we create our corpus objected and apply common transformations to make the text more uniform, allowing for more accurate modeling down the road.

# Create a Corpus-type object for manipulation using the tm library
corpus <- Corpus(DirSource(dir()))


# Transformations applied to body of corpus. We do these to make our corpus more uniform 
# and cleaner for tokenizing purposes. I am also replacing a number of common symbols with 
# spaces. These symbols are often used to denote things like websites, email addresses, 
# word choice alternatives (e.g. his/her), etc. 

# Generic function to replace a pattern with a space
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) 
# Use above toSpace function to replace /, @, | and # with spaces
corpus <- tm_map(docs.samp, toSpace, "/|@|\\||#") 
# Change all letters to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove common english stopwords that will throw off predictions
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#  Delete extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove digits
corpus <- tm_map(corpus, removeNumbers) 
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

Corpus Exploration

A Term Document Matrix is simply a matrix where documents are the columns and terms are the rows and a count of each term for each document populates the matrix. We are going to create and manipulate one of these in order to get a feel for the most common short phrases across our corpus. We are going to use the R package RWeka to focus on n-grams of varying lengths, not just single terms (or unigrams to keep with terminology).

# Create ngram tokenizer formulae to utlilze in Term Document Matrix, we use the 
# NGramTokenizer function from the RWeka package loaded earlier

UniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BiTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TriTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
QuadTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))

# Create 4 Term Document Matrices (TDMs), one for each n-gram of interest. Technically 
# the Unigram function doesn't need a control modifier as the function defaults to
# 1-grams. I created a UniTokenizer function which is utilized below for demonstrative 
# purposes

options(mc.cores=1)
TdmUni <- TermDocumentMatrix(corpus, control = list(tokenize = UniTokenizer))
TdmBi  <- TermDocumentMatrix(corpus, control = list(tokenize = BiTokenizer))
TdmTri <- TermDocumentMatrix(corpus, control = list(tokenize = TriTokenizer))
TdmQuad <- TermDocumentMatrix(corpus, control = list(tokenize = QuadTokenizer))

Visualization of N-grams

Now that we have our TDMs we can look at the most common n-grams across our entire sample text.

library(wordcloud) ; library(ggplot2) ; library(RColorBrewer)
par(mfcol = c(2,2))
wordcloud(df.freq.uni$Term[1:20], df.freq.uni$Frequency[1:20], 
          colors = brewer.pal(6, "Set3"), scale = c(2, .25))
wordcloud(df.freq.bi$Term[1:20], df.freq.bi$Frequency[1:20], 
          colors = brewer.pal(6, "Dark2"), scale = c(2, .25))
wordcloud(df.freq.tri$Term[1:20], df.freq.tri$Frequency[1:20], 
          colors = brewer.pal(6, "Paired"), scale = c(2, .25))
wordcloud(df.freq.quad$Term[1:20], df.freq.quad$Frequency[1:20], 
          colors = brewer.pal(6, "Set1"), scale = c(2, .25))

Next Steps

Moving forward we are going to utilize well established modeling and prediction techniques to create an algorithm that can predict the next word in a sentance. We have gotten a basic understanding of the corpus of text we are working with and have many things to consider moving forward.

Are there ways to change the preprocessing step that will better format the data?
How can we use our distributions of n-grams to make a predictive model?
Since our end goal is to create a web deployed app, we must always be mindful of our code efficiency and how much memory we are using.

JHU DS Specialization Capstone Milestone Report

Jordan Kassof

July 25, 2015